Home Resources Videos Building The Perfect Fabric for AI: DriveNets Scheduled Ethernet

Building The Perfect Fabric for AI: DriveNets Scheduled Ethernet

In this webinar, Drivenets and Edgecore explore the various approaches to creating high performance, high scale AI-compatible networks. They outline the revolutionary Fabric Scheduled Ethernet (FSE) solution, the most proven architecture for building open, Ethernet-based and congestion-free fabric for highly demanding AI clusters.

Chapters:

Full transcript

Welcome & Introductions

Alright, so. Good morning, good afternoon, and good evening everyone. depending on where you are around the world. Thank you for joining us today. this is Naomi Chen, the product line manager at Acton Corporation and covering our brand business Edgecore network. I’m basing the Bay Area, United States.
So, feel, feel free to cling me or drop me a message if you want to. have a coffee chat, anything. So those of you who, join our webinar series for the first time of this year, welcome to, we have partnered, with the industry leaders to host these sessions, specifically to discuss what we call the ecosystem of choice.
So I’m sure everyone on this call, in here is because you are interested in having more flexibility in your network or the IT infrastructure. So today I’m joined by, Sandy Nan. The director of AI networking found DriveNets, to explore the various approaches to create, high performance, high scale AI compatible networks.

The AI Era: Why Networks Matter Now

Before we dive into the, the, the details, I want to set the stage, for how, while we are bringing these pieces together. So for those of you who have been in the industry for long, you have likely noticed the 10 to 15 years cycle. Then our lives, from PC to internet to the cloud. And now I think we can all agree that today in 2026, AI is no longer just a buzzword.
We’re actively, weaving into this, very fabric of our global infrastructure. In this new era, the network is no longer just a support system, I believe. No, because we, the scale up and scale out all the things. It’s one of the, no, it’s kinda the heart of this engine Now. This is where this ecosystem choice becomes critical.

The Case for Open Networking

So at Edgeco we believe that open networking or network segregation, is much more than a, just a technical term. It’s a freedom to choose to choose. it allows you to pick the best in class power. hopefully from us, the operating system and the management layers, for your network. You can optimize every keys of this equation to build a solution that is resilient, as resilient, and as is scale scalable.
So why do, sorry. Why does choice matter? Is there still a benefit to vertically integrated stack from a single vendor? So, absolutely. but we have realized that when you desegregate it properly, correctly. So you unlock the tremendous value you can bring together the best chips, the best hardware, and the best operating environment, all that thing.
When you do it correctly, you this whole process like make the completely reduced your total cost of ownership. You only pay the features you actually need while gaining the full control over your supply chain and the product life cycle. Now also then, as you need to change, you don’t have to re-architect your entire network.
You can simply swap a specific pieces that you need, maintaining the operational consistency because you are the one, defining this architecture. We also see a massive boost in innovation speed, let’s just using, LM as an example, the. Capacity of AI agents has improved dramatically because multiple vendors are, innovating simultaneously.
So when using this as an example, when focused teams put technology together without the burden of a legacy baggage, the rate of innovation, double, triple, or even copos. So, have we done. Segregation before. Is that a segregation, just a new term? No, we have done it in the IT for years, so, and back in the day, everything is IBM in the data center, and now we’re naturally use storage from one vendor, servers, software from a very different vendors.
So this is that even with a vertical integration, the market, the entire market ultimately shifts toward a open ecosystems though, which is why this interoperability becomes so critical. We have proven we can bring multiple vendors together and the only challenge has been how we, how well, they can stitch together.
That is exactly why this ecosystem solves this problem. let’s look at just, purely from the number perspective and use Sonic as a, primary placeholder for this discussion because, Sonic is kind of a, a symbol for the, open networking, right? so. We have seen this from, the overall, predict that, 10% of the market is disintegrated in around 2026 now, which is now, and the target expected to hit, 20% within a year in 2027.
So we’re seeing this rapid adoption rate in the data center, and so you are not alone in this jo journey. And so to make this work, we look at three layers of network disaggregation. So at the very top layer is your, management and orchestration. at the center layer is a network. Operating system is a no.
The bottom layer, this is where each core spend a lot of our time, is the hardware itself. So we work closely with all our partners to ensure their solutions play perfectly with our hardware. You don’t have to make a science project out of this, you can simply deploy it and start receiving the value.
So this whole ecosystem will be, is tested, assured. Ready to be stitched together for your specific application. we’re here to help you to navigate this transition, not just for the day zero of deployment, but for the long-term lifecycle and operational consistency that you need to thrive. So I think this is a stage and that’s explore how we can build your AI infrastructure together.

Transition to DriveNets

So our speak, sunny, serves as a director of AI networking at DriveNets, will take us to learn how the DriveNets operating systems, running on Edgecore networks, switching platform and deliver the AI you want at the performance. The budget you’ll need. So semi, oh, sorry, semi, that’s back to you.
Thank you very much Naomi. I will share the presentation in a second.

DriveNets Introduction & Agenda

Okay. So again, thank you very much Naomi. And as you mentioned, I’m sunny from DriveNets. And today we’ll talk a little bit about building the perfect, perfect fabric for AI. before I start, I want to share a funny fact today when I told my daughter that I’m doing a webinar, she was asking what it’s about.
I said About AI fabric. She said, what does it mean? And when I said backend, suddenly her eyes were lead up. Say, oh, Beckon, I know about Beck End. So Beckon is like Roger Federer one hand, backend. I said No. No. Okay. It’s good that about it, but it’s a totally different story. Although I think, same as the Federer, I think, our backend solution is, one of the best and.
Again, it was funny to me that backend is something, that was, familiar for her. So the agenda today, I’m gonna talk a little bit about DriveNets in case you’re not familiar with DriveNets. some of the challenges of the AI era, from the network networking and the cluster perspective, dive into the different options of, networking that exist today in the industry.
And then more detail about DriveNets solution and all the layers. similar to the only, three layers of networking about the hardware, the operating system, and the orchestration, I think, we have a full solution, that we would like to present and, and show you the capabilities. we’ll also show some use cases of, key customers and the reason, the challenges they had and the reason they chose DriveNets.

About DriveNets

And in the end, we’ll have a q and a session. So a little bit about DriveNets. so DriveNets, was founded, in. 2016, we are celebrating 10 years. the last year we had massive growth, both in revenue and also in, employment, number. So we grew, to more than 650 people. Most of them, multiple, 80% are r and d.
we are cashflow positive. We’re still a private company. and historically we have huge success. we started with offering to service providers, routing, solution for their networking. And today we are part of at and t Core network. Most of their core network is running through DriveNets.
Same with Comcast. And we have other customers around the globe, in Europe, in Asia, that all of them are, using DriveNets technology, for different part of their network. but again, this is the service provider network. What we want to talk about today, we have a different product line. our, DriveNets AI fabric product line, in this product line.
We have success with the hyperscaler, with some LLM. Vendors, UGLM vendors, the most, typical customer will be a new cloud, a new provider of a AI services and large enterprises, that are using, AI. Any customer who is building on-prem AI cluster, would need some kind of, networking solution.
From partnership point of view, we are partnering with multiple, leaders in the industry. Different, optic vendors, ODMs as Acton and Edgecore, some IC vendors, and mainly Broadcom. That we use the technology, but not only, and, from standard point of view, we are, actively participating and influencing on the OCP, the Open Compute project, the tip, the telecom, infra project, and the UEC, the Ultra Internet Alliance.
And through them we are, preaching for ethernet and, promoting, the open solution.

DriveNets AI Fabric Solution

So from product line point of view, sorry. As I mentioned, we have two product line, the service provider, a product line that there we are offering a, a full network transformation moving from legacy networks, from multiple networks. To a single network. we are escorting the service provider through this process, but again, I don’t want to spend too much time on that.
You can read it in our website. Today’s focus is about our AI fabric solution, where we offer, not only networking solution, but a full stack solution, different layer, that are more cluster focused because we understood that the customer don’t, ask for network. They want a. Fully operating cluster and very efficient and high performance cluster.
Basically our solution is, is the highest performance. I will show you some results and we are offering end-to-end solution to our customer. And as you can look at it is a, a turn carry alternative to NVIDIA solution when NVIDIA are coming and offering a full solution. Today at DriveNets, we are trying to offer customers a full key solution.
Based on different GPUs. I’ll touch it in a second. one of the unique points is that we have our own operating system that can serve different application, different requirements, different networking architectures. All of them are coming from with the same support of a single operating system. And again, a very proven operating system.
Actually the same one that is using the service provider for the last 10 years. is serving our AI infrastructure as well. so a full stack solution. How does it look? Let’s look on the bottom. You can see the network platforms. These are the switches and the routers, that again from our AI solution are coming from H Core Acton.
Both the, the routers and the switches that we’re using. on top of that, you can see the backend and front end, the application, the different scale up. Scale, out scale across to remote location and front end. on, on top of that the operating system, and we have, the a CO, the AI cluster orchestrator.
This is actually a tool, that I, I will present in details, that can provide additional services, on top of just network, but brings, the unification of the network and the cluster and the GPU’s performance. Together. We also have, the DIS, the DriveNets infrastructure services. This is a team of experts, that can help customers on the different stages from their planning, installation and up to.

AI Networking Challenges

Performance, benchmarking and handover and training, that I will touch in details. So when we look at the challenges, and again, let’s look at the diagram, what I’ve seen in the last two years and talking to customers, most customers would install AI. We’ll go to NVIDIA and we will buy a full solution from NVIDIA.
It’s safe, it’s proven, and, and it’s working, so that’s the good choice. AI is complicated. You are not really familiar with it. and that will be a very safe, option to choose from. And then you will get the NVIDIA solution. And from networking there is the NVIDIA InfiniBand, which again, it’s a proprietary solution, solved by NVIDIA.
So that’s the first step. and then customers, they start to see some of the challenges of InfiniBand. And if they want to move to something else, they have some challenges, right. First of all, it’s performance. They want to make sure that if they move to a different solution, it still brings them the InfiniBand level, performance, the job completion time, the latency, the fact that it’s lossless, the RDMA functionality, GPU utilization in the end.
So they want to make sure that if they move to something else, it needs to have InfiniBand level performance. They actually have issues with the vendor lock. When you buy InfiniBand, you are locked to, to NVIDIA. the lead time, you cannot really control it. If you don’t have equipment, you need to wait for it.
You don’t have any other option to buy it from someone else. so actually you have a very strong lock, on the GPUs, on the NICs, on, on the switches themself. So if you want to move to something else, you don’t want to move to another vendor lock. Moving from NVIDIA to other companies, let’s say networking like Cisco or Arista, that you have a new vendor log that you need to buy from them.
The switches, it’s a monolithic solution. The hardware and the software and the n and everything needs to come from them. And there are different licensing options. the deployment today within finna bend, it’s complicated if you’re moving to another version of ethernet. Which is also complicated. They need a very extensive, configuration in the installation.
That’s also a challenge that they want to overcome. They want to move to ethernet, but they want it to be easy to deploy and easy to manage. another challenge come from the application when they are building a cluster or they realize it’s not being used by one team or by one customer. it’s used by different teams or different customers if it’s a new cloud, a cloud provider.
So they want to make sure there is a good tenant, isolation without any penalty or overhead. so that’s also a challenge. They need to meet, cross site connectivity, moving to a different location if you have some limitation on the power. For example, I think this is the main challenge that they have.
They don’t have enough power in a data center to build the AI cluster that they would like. So they’re building it in two different. Geographical locations. So they need to connect those location and still keep the high performance, still keep the network as lossless. So this is also a challenge. They’re looking into a good solution and in the end of the day, cost is also a, a major player.
And when you buy InfiniBand, you pay the premium. Customer said, okay, so I paid the premium once, but then I have the yearly maintenance contract, which is 20% of the initial cost. So the initial cost was very high, and now the maintenance is also very high. So I want to get rid of that. It’s something we would like, to move and move to a, a more attractive, more cost effective solution.

Networking Options Compared

So what are the alternative again, we talked about InfiniBand. InfiniBand is a very good performance. It’s scalable, it’s efficient. Again, NVIDIA is doing great job on that. But when you look on the disadvantages, it’s proprietary. Everything needs to come from NVIDIA. the flexibility is, is limited. it’s tailored to a specific application, specific workload when you need to make changes.
And your workload, you need Invidia to come over and, and educate you on, on the configuration and it’s complex. every organization, they have good it people that familiar with ethernet, but InfiniBand, it’s something you need to learn and you need to be an expert on InfiniBand. And a lot of, enterprises don’t have this type of expertise and the building and InfiniBand team of experts.
It’s not a, a very wise thing to do on the long run. So what are the alternative when moving to ethernet? Everybody is convinced that moving to ethernet, even NVIDIA are moving to ethernet, right? So they are part of the Ultra Ethernet Consortium and they have a ethernet based product line. And it, and again, as it looks like in finna bend, is in a sunset mode.
And, there will be no. major releases after this year on InfiniBand. so everybody’s moving to ethernet. when you look on ethernet, you can have a single chassis solution that has more than 500 ports on the chassis of 800 gig. so it can be very efficient. The performance will be good as long as you are between one chassis.
Because you have the back plan, it’s non-blocking the architecture. but it’s limited. It’s limited to 570 ports. So if you want to move more and grow your network, you want to scale to a higher number, then you have an issue and your block. And again, it’s still proprietary. You buy this chassis from a specific vendor.
It’s a monolithic solution. Everything comes from this, vendor. So chassis is not a good solution when you want to scale. And even today, when customer are starting with 256 GPUs, it’s obvious that they will need more and, the chassis will block them. with scalability. The other solution is ethernet Clos.
it’s a leaf spine architecture that you have, two switches. And then, the deployment is easy. It’s very scalable. You have high atic switches that you can have many ports. you can grow. The, the drawback major one is performance. This is ethernet normal. Ethernet was not built for AI. It’s a loss technology.
Ethernet is built on, on losses in the network, which is very good for internet users, for mail users. But it’s not good fit for AI. so the efficiency of the workload is very low. and that’s for, it is not majorly used, and we will show you some numbers. That customer evaluated the standard internet, and again, the performance is, is totally, unacceptable.
And then the next evolution of ethernet is, a endpoint scheduling using Smart NIC me. It’s a congestion control ethernet to make the network, the internet network a bit smarter and you will use at the edges, smart NICs that they will actually manage. all, all the, the networking itself and make sure they are, they are reacting in case of congestion.
So the deployment is easy, it’s still scalable, but the network performance is not there. Instead of, avoiding, congestion, you actually manage congestion. When congestion happen, then the NICs and the network is reacting. So, the, the performance will will not meet in Fin Bend level. according to UEC, even, that this is part of UEC as well, it’s, it’s mentioned as sufficient performance.
And in the end of the day, you need to use those smart NICs. They are very power hungry, very expensive, and it’s complex. To manage it, you need to deal with all those different. buffers and ECN and PFCs and to try to manage, the buffer according to different workloads. so again, it’s, it might be a good solution in specific cases, but in the end of the day, it’s not a good generic solution.

Fabric Scheduled Ethernet (FSC)

And if you want to have InfiniBand level perform. you need to move to the next step. The next step is, is as we look at it, the smartest option of ethernet, the highest performance ethernet for AI, which is schedule fabric. In this case, the fabric itself is scheduled and is a smart fabric. So the network itself is smart.
It’s a non-proprietary solution. You can use, different, different vendors for the hardware. You can use different vendor for the software. it’s flexible. It’s scalable. The efficiency is very high. the network performance is high, and I, I will dive into more details to explain why the network performance is, is high.
The deployment is very simple again, because the network itself is, Handling all the, the congestion and making sure they don’t happen. You don’t need to configure different buffers, to manage. You don’t need any telemetry. The fabric itself is, doing, the smart part. So in the end of the day, how it will work.
Look, instead of using the InfiniBand for the backend solution, connecting the GPUs and standard ethernet for storage, you can unify. The, the backend solution and the storage networking into one ethernet, that will be a high, performance network. Not only the GPU will enjoy it also, the storage will enjoy it.
So how, how do we do that, right? So how do we turn the, the standard ethernet into a lossless predictable network? So if you look on the, the diagram itself, you can see it’s still a leaf spine technology. There is, the leaf, the, the NCP, the packet forwarder, and the spine is the, the fabric, switch and, and then the connectivity to the GPUs and the storage is done, can be 200, 400, 800 gig.
Soon, 1.6. And the fabric itself is, doing, if we look on the, the bottom three points, cell spraying, this is one of the three major building blocks, cell spraying, meaning the network itself will take the packets coming into the network and will cut them into small cells and will distribute the cells across the entire fabric.
So from one leaf, the cells will be sent to all fabrics, and from the fabrics it will go. To the destination leaf that makes sure that there is no unbalanced, connection, connectivity in the network. Meaning we cannot have a situation that one link is 80% loaded and the other one is just 20%. Because it all time the network will be evenly loaded, so all the connection will have 60% or 80% or whatever.
Because everyone is cut into small cells and the cells are spread evenly. So this means we get perfect load balancing without any need to do any configuration. So it’s coming out of the box. The second point is the credit scheduling and receiver based credit, meaning the receiver, leaf is providing credit to the sender.
Making sure that if you send me data. I can handle it and I have enough bandwidth and enough capacity. So the known issue of InCast that suddenly one destination gets more than 800 gig will not happen because it’s credit based, actually similar to InfiniBand, but base in internet. This is the second point that makes, the network lossless.
And the third one is the Ingres VQ virtual output queues. That means, that we have multiple vir virtual output queues in each, leaf, router. And then different GPUs or different, tenants will be isolated, meaning one will have a certain VOQ, the other one will have a different VOQ. So if once. It is starting to load the, the leaf air router.
It will not block the others. So you actually have separation isolation, strong isolation between tenant. And this will make sure that you are avoiding what’s called the noisy enabled phenomena when one user is using more bandwidth that, is entitled and actually interfering to, to all other users. So with these, multiple VQs.
We actually, totally eliminate the, the noise enabled issue. So just to, to, to summarize that from, from technology point of view in case it’s interesting. So again, we are talking about the, the Edgecore boxes, the, from ICS point of view, we are using here the, Broadcom Jericho tree and Ramon tree ics, that we are using them in the leaf and in the spine.
another point that is provided by the Broadcom hardware itself is the, the third point in the how it works, the fast failover. The fact that the network itself in the hardware, base would know that there is a failure in one of the links. And immediately we’ll update, the entire cluster that this link, is down and all the traffic will go through other links.
So this is done in microseconds because it’s being detected in the hardware, and you don’t need to have the upper layer or the upper application layer to detect that and to react and to implement different protocols and different algorithms. To make sure, and then it takes much more time and during that time you are actually losing data and that’s influencing the job completion time.
So this is one of the benefits we are getting from using, Broadcom. Ics. So in the end of the day when you’re using, fabric schedule, ethernet customer will get predictable performance. They, you can plan for workloads and then you’ll get very high, job completion and very good job completion time at high scale.

Performance Results

The deployment is very fast. You don’t need to do sophisticated tuning, and of course you can, support multi-tenancy, which became a, a major issue, recently. So we also have some kind of, results that we compared. these results were measured, DriveNets result were measured in a live production cluster.
installed in Europe and the measurement were done by a third party. And, you can see the DriveNets result in blue. as I mentioned at the beginning, our customer goal is to have InfiniBand level performance. So if you can be in the same level of InfiniBand, that will be amazing ethernet with the same level of infin.
But what we have seen that you can see here in the graph, the graph will show you, on the left side the bus bandwidth. This is, the throughput of the entire cluster. And on the bottom you see different NCCL message sizes. This done, this test was done on, NVIDIA GPUs, and when you’re using the DriveNets FSC, you can see that our results are better.
And again, 6% is the average when you look on the entire different, packet sizes that are being used. Traditional ethernet, of course, will have a very low performance, again, because of the losses. And the NVIDIA spectro, is a bit lower than InfiniBand and still InfiniBand. we, we are showing that we can have better performance than InfiniBand all across the packet size.
Here, it’s more detailed, results showing, DriveNets versus InfiniBand. And again, this is not the average. You can look on different, comm Communications, and you can see that all reduced. For example, we are 2% better. And again, the goal is to be at least on par with InfiniBand, and you can see that, that we are even better than that.

Hardware: Edgecore + DriveNets

okay, so this is the, the results. when we look on the cooperation, sorry. When we look on the cooperation with Edgeco, so on the how point of view, from, from DI’s point of view, we are using. routers and switches. The leaf itself, the drawing is 5,000 series and the spine is the 9,001, based on Jericho as I mentioned.
And we have two flavors. one is, the standard leaf and the other one is with HBM, with high bandage memory that is being used when you have scale across. When we need to move to a different geographical location, then you need. Eh, better buffering to make sure you keep the lossless functionality when, when you have a long distance connections.
from, from, switch itself, these are two U boxes, 30.4 terabytes, 18 ports going to the GPUs, 20 ports going to the fabric on boat. And the fabric itself, it’s a beast. It’s using two Ramone ics with total truett of 1 1 0 2 0.4 terabytes with 128 ports of 800 gigabytes.

AI Cluster Orchestrator (aCO)

So we are coming to the, the additional layer that DriveNets is offering, the, the a CO and this is automated end-to-end infrastructure orchestration services. That helps the customer in the planning deployment stage, in the validation and the operation stages, all of that is coming in an automatic tool.
in the end of the day, we would like our customer to have simple, process, error free, and get, the highest utilization out of the hardware. And that they have acquired. So when we look into more details, first of all, this is one tool that we have, the, the fabric planner that really helps. We have, validated topologies that we have tested inhouse of, of the networking part.
So the customer will come and say, I’m using a MD or I’m using in, NVIDIA, which type of GPUs, how many GPUs? what is the architecture I would like to use if it’s top of rack or, top of rack optimization or rail optimization. and, you will choose the different, cabling. You would like to use optical cabling or duct cabling.
And in the end of the day, we’ll get. A diagram, an architecture that is pre-validated with the bomb. so everything can be fast and no mistakes and a lot of confidence because it was pre pretested. So this is the planning stage. After the planning stage, sorry. After the planning stage, we have the provisioning engine tool.
This is part of our a CO. This actually brings end-to-end automatic cluster configuration. so it will help the customer to do the bring up, to deploy the required drivers, even in the, in the servers, right? If, there is a specific workload you would like to use specific, Parameters you would like to optimize the network need to be optimized according to the customer requirement.
So we will do, the cluster bring up, we’ll do smart apology discovery, meaning you will take the list of lifts and spines, what are the IP address, and then all the configuration will be download, to the, to the device. And in the end of the day, it is a zero touch provisioning. Everything, is automated very fast.
I will show you an example. This is an example, some screenshots I took from the tool. On the left side, you see the input, you see the, the IP addresses and the the plan, the location, if it’s a leaf in which rail. This device sits and then you push off a button, all the configuration, all the parameters are being downloaded to the entire cluster.
It can be tens or hundreds of switches, whatever, and you get it in no time. The second tool that we have in the a CO is the benchmark engine, and this is very important. This is, this goes beyond, the networking. we have different tools that run into state in, within stages. for example, the audio, DMA runner that will, will, will be testing the physical network.
That we’ll be testing the physical network and they will verify that every NIC, for example, is being used and optimized. If it’s a 400 gig NIC, that you’ll get at least 97, 90 8% a throughput on the NIC. Then we have the collective sweeper that will look on the nickel or the recall performance, and we’ll optimize the network according.
to, to the customer requirements and our optimization. If it’s a training workload or inference workload, and in the end of the day, there is a, the top 500 benchmark. If a customer would like to know the, his cluster performance compared to the top 500, this tool is supposed to be used when you deploy a new cluster, when you onboard new tenants, and you want to make sure that you keep the utilization and performance and periodical checkups.
And the third part of the a CO is the day end management. the tool that actually collects telemetry, we’ll show you the lifecycle. And again, it’s not only on the networking and also on the GPU itself, what is the utilization percentage, the terminals of the network. And in the end of the day, this is the, the day-to-day management tool.
but again, based on our 10 years of experience, it is a very sophisticated and simple to use tool with a lot of functionality. The DIS team, as I mentioned before, is the one that can help customer from the design phase, that the installation, the record stack, the configuration, validation, benchmarking, and in the day enablement and training.

DIS Team & Customer Results

all of these functionalities and support are part of our offering that customer are getting and actually very happy, with the results. I gave you a couple of examples. this of the results of the DIS team, this is for example, a customer using a MD, and they wanted to have, to squeeze the performance out of their a MD for any inference.
Cluster. And what we do did, we started with a single node, just with eight ports, and we optimized the drivers, the operating system parameters, the bias configuration, and the parameter you see here is, the concurrency, the requirements. How many user, in concurrent are using, the cluster? And what is the tt, TTFT, the, token,
Oops, sorry. So, so, again, you can see here, the time to first token. Sorry for that. so when they, they wanted to have a lower time, to first token, it’ll be better. And as you can see, the blue one is the DriveNets, solution compared to a industry standard solution. That, used with a MD and again, this is before the networking part, this is on the single server.
The second stage was when we looked on a multi-node, then the network comes into play. So on top of the, the GPO optimization, on the server optimization, we did the networking optimization and we use the specific parameters on the network. Again, based, based on our knowledge, on the tools, they are able to download the the right configuration.
For the specific workload. And in the end of the day, if you look on an average, you get 15% improvement in the TTFT compared to the industry standard. okay. From, a case study point of view, again, I will not spend much time on that. If you’re interested, you can look at it. On our website, we have different, type of customers, a enterprise customer, this is a biochemistry research institute.

Customer Case Studies

Then came to us. They said our first cluster was NVIDIA InfiniBand. We want to move to ethernet. They tested different vendors and they decided that our solution was the best. They like the performance and the fact that we have unifying fabric and multi-site connectivity. This is something they planned to do.
from Hyperscaler point of view, in LLM, again here, I gave one example. But basically again, they, we did a POC, with Bidens and, they tested different versions of ethernet and then the smart NIC ethernet and the FSC, and they saw that they are getting 30% improvement in job completion time compared to any other ethernet option.
So this is the reason they chose DriveNets. last one is a neo cloud. So this neo cloud again, we met them, last year and again, the project was very fast. They wanted, to use ethernet. They, they chose standard ethernet. They, they saw that the performance is not there. so immediately we did a POC.
they started doing a single location in Europe, and now they’re moving to multiple data centers in Europe. this year, this is the way it looked. And again, just to emphasize that, you can see both the GPUs on the left side. And the storage is part of the same network, and they were, they were very happy with their results.
I have a short testimonial, a minute and a half video from the customer. This is Tom, their CTOI will let you listen to his testimonial for a minute. Hi, I’m Tom Sanfilippo. I’m the CTO at White Fiber. We’re an AI cloud company focused on performance and scalability for AI applications of any size. One of our biggest challenges has been building GPU clusters around ethernet technology, but without the potential bottlenecks of mixing workloads from multiple customers simultaneously.
In the past, we would build GPU clusters using separate networks or storage. And compute. This makes it challenging to build very large networks. It increases the network complexity. It increases the number of components we have to use. DriveNets’s, fabric scheduled ethernet has allowed us to converge our GPU and storage workloads onto one fabric.
This is something we couldn’t do with other technologies. It’s provided great efficiency in terms of. Network throughput enables us to support multi-tenancy better by providing isolation and fair queuing across the entire network infrastructure. We are looking to expand clusters using the drive deck technology and apply it in new areas as well, including cross data center connectivity, cluster to cluster connectivity and building larger super clusters.
The DriveNets infrastructure services team was also a really big help in getting our. POC built rapidly and proving out the technology for us to make the decision to move to the next step and bring that environment into production. They did an amazing job. It’s been a real pleasure working with them

Key Use Cases & Advantages

Yeah, so again, it was an amazing project and its ongoing, activity with, white fiber. I would like also to touch in, in a few seconds about some of the key use cases where the technology shines and give unique advantages compared to any other ethernet, multi-tenancy, unified fabric scale across and chassis alternative.
We touch it a little bit, but in multi-tenancy, the requirement is to have stronger isolation between tenants. And to have good load balancing between tenants and do all of that without overhead. You can do that in other technology with some encapsulation with virtual lands, so different tunneling technology, but that will give you an overhead.
It’s a lot of penalty to pay in the throughput. So in in FSC you don’t have that. That’s very important in multi-tenancy. And this is the reason when customers are, are testing our technology. They’re very happy when they have multiple tenants in the network. Unified fabric, we talked about it again, to unify the compute and the storage.
traditionally when customer are talking about unif, unifying the network in internet, they told me, oh, it’s complex because we need to manage it. We need to separate the different traffic. So that’s in traditional internet, you need to do a lot of, configuration and a lot of efforts in FSC with the multiple VQs.
It comes out of the box. So it’s on the other way around. It actually makes their life much simpler. Scale across, we talked about it, and again, other solution are talking about having some kind of scale across. But, the fact that we are using, the Jericho I see with the HBM, this is the best way to make sure you are keeping the lossless functionality.
You have, HBM, you have buffering option. if the, if the connection between the site is jittery. You are able, you are able to overcome it. If you have a software solution, you actually cannot overcome those bursts of, traffic. So the best way to do it is with hardware, with HBM, and again, in our solution, we have it in, in standard ethernet or in finna bend.
they don’t have this HBM, so. And also there is the chassis alternative. If somebody would like to do a massive scale, they want high erratic switches. So, DriveNets is known for a disaggregation, meaning you can take the boxes and actually cluster them into a chassis like solution. And this solution, you can see the DriveNets fabric schedule is not considered as an ethernet hope.
It’s not, it’s actually a cell base, network. So if you have, you can still have a, a leaf and spine technology still with two hops and have massive scale of GPUs. So you can use a standard tomahawk boxes actually coming from H Core as well. And on top of them, the DriveNets, DriveNets solution, that will look like a, a chassis.
So you can use multiple clusters of those. So this is the chassis alternative. So if we look at the advantages, again, we talked about the performance, the multi-tenancy. what’s new here is the openness. And I want to lever to emphasize that you are, when you’re using DriveNets, you avoid vendor lock. You can use any GPU, you can have mix of GPUs, you can use different mix, different optics.
You can buy them the yourself. You can ask DriveNets, services to buy it from you. So we have our own relationship and our own inventory, but in the end of the day, you are not locked and, you can use whatever is available, whatever is certified in your organization. So this is, very important that this solution is on top of all the benefits.

Summary

It’s open. So to summarize it, what, Dragnet FSC is a mixture. It’s best of both InfiniBand and ethernet. From InfiniBand, we take the strong point of performance. We don’t take, the other, limitation, from ethernet. We take the scalability, we take the cost attractiveness of it and the openness of it.
And in the end of the day, DriveNets. FSE provides you, the best solution based on ethernet. So this is the quote, the highest performance internet fabric for AI workloads based on fabric schedule, internet with minimal time to first token and again, DriveNets. We are not only a networking solution, from AI, we are providing, the, the full, the full solution full stack solution, that includes the networking and also advanced tool for cluster management.