AMD MemCon Interview Transcript

Mar 28

Allyson Klein: Welcome to the Tech Arena. My name is Allyson Klein and I'm delighted today to be joined by Mahesh Wagh, senior Fellow for Server system architecture at AMD and the co-chair of the CXL Technical Task Force. Welcome Mahesh

Mahesh Wagh: Glad to be here. Allyson. Looking forward to talk to you.

Allyson Klein: I have so many topics to ask you about today, but why don't we just get started with a just a statement about data centers. Data centers are at the center of innovation for everything from new breakthroughs in AI to new digital services, redefining industries. Yet we've relied on a consistent definition of data center compute for decades with architecture defined by rack based pizza box servers. Why has the industry stayed true to this architecture for so long?

Mahesh Wagh: Yeah, that's a very good question. I think if you look into from, you know, where the industry's going, and from a data center perspective, if you're looking at the use cases, industry's looking at what is the best way to innovate on existing platforms and how do you bring incremental value, right? So if you look at all of those things in terms of you know, what is the best return of investment that you're going to get? It's really usually on those incremental technologies that you build. So anywhere you find an opportunity where you can recoup investment, build incremental technologies, that's where it, you know, kind of takes off. So from that perspective, data center is enabling a lot of businesses, right, to kind of transform to this data center servers. So then you're looking within the industry, how do we bring all of these new applications more with, you know, just the incremental approach? And you're innovating within that space as well. So don't get me wrong, but when you look at that, it's like, what's innovation, what's the best return of value that you get on in innovation that drives you towards incremental technologies?

And when you think about something incremental, it's building on top of what you already have. So that's what we tend to see within the industry.

Allyson Klein: Now CXL is the topic for today, and CXL has been introduced to data center platforms, AMD introduced to it with Genoa, why is this so critical of a technology and what does it change terms of what you can do with data

Mahesh Wagh: CXL builds on top of PCI Express. As we all know, PCI Express has been there for more than two decades and, and going very, very strong on that. Right? And from the interconnects or IO perspective, it's giving you tremendous amount of bandwidth and is on a very great cadence to provide capabilities, right?

What CXL brings, the first level is, you know, new use cases, new use models on top of what exists today is on a PCI Express sort of infrastructure. What is it bringing to the ecosystem? It's bringing new use cases that require cache coherent interface and providing opportunities to innovate on memory technology, right?

So that is what it is bringing. So today it has established as an industry supported cache coherent interface, defined by the consortium that works on an existing technology, which, which is, uh, PCI Express. Now why is it a game changer? It's doing two things fundamentally. First, from a just a consortium perspective, it's pretty much bringing, you know, all of the compute vendors, all of the memory vendors, the data center enterprise companies that are producing solutions, and the application developers sort of in one common place to address the emerging requirements of the market, right? So, so that's great. We have convergence there. From capabilities perspective, it's providing you the cache coherent interface and a memory interface. So all of the things that typically at a CPU, your applications could take advantage of cache coherence from, CPU core perspective, you're now providing those same capabilities for accelerators. Things that existed in terms of memory technology, the memory controller was always integrated within the CPU. So yet anything related to memory technology would go through the CPU u. What CXL is enabling is for providing innovative solutions, where now the memory controller is outside of the CPU connected with CXL.

So in a nutshell, why is it changing the game? It is providing the opportunity to innovate on an existing infrastructure, and that is big, right? And an opportunity to innovate for different reasons. Either folks are looking for differentiated value added products. People are looking at building products that would provide more T C O in terms of existing solutions.

So as a result of that, right, the opportunity is significant to both innovate and bring value on a platform, and that in my mind is what is going to make it a game changer.

Allyson Klein: Now, I mentioned that Genoa does support CXL, which is the first platform from AMD to support it. How did you decide to deliver CXL at this time and what are the specifics behind your support on that

Mahesh Wagh: So, when we looked at, there were sort of these two aspects that we talked about, accelerator attach and memory attach. If you look into the ecosystem between the two, there is a significant amount of pull towards the memory attached part of it, right? So from getting to the market and bringing that into the product, what AMD had thought about it is what are all the key features that you need to enable memory expansion? So from that perspective,

Allyson Klein: Mm-hmm.

Mahesh Wagh: with CXL 1.1 and with the four gen AMD EPYC processor, we wanted to first address the system flexibility, which is, you know, can you provide the biggest configurability and flexibility to the system vendors, in which case you can decide to put like a high bandwidth memory expansion device behind a single port of CXL.

Or you could decide to bifurcate the port. Those are the capabilities that we provided from a system flexibility perspective. From a media perspective, CXL by definition is agnostic. So when we were looking at what we can provide, we have solutions that would provide the media type to be either DDR5 or DDR4.

So that's giving lot of TCO advantages for our end customers. We're looking at recouping their investments. So they're looking at, okay, I wanna do memory expansion. , can I put my n-1 dimms, for example, DDR4 behind this controller and now provide a memory expansion solution that is very cost effective, right? So we enabled that.

Allyson Klein: Mm-hmm.

Mahesh Wagh: Security continues to be a really important piece. So one of the differentiating things that we provide with Genoa is all of AMD's, Infinity Guard security solutions that are available today for direct attach memory. They just extend seamlessly over CXL. And as we all know, security is a primary, technical pillar for any solution that you want to deploy on a server CXL with the Infinity Guard, you know, with CXL is just seamlessly you can deploy. So that's one of the great things. Uh, we support tiering. So when we are bringing CXL devices, the key part about it is that it's latency characteristics are different than what it is going to be with direct attached memory, and there are a lot of developments that have happened in the ecosystem related to that, which is understanding of non-uniform memory accesses, NUMA nodes. And what CXL is really doing is it's bringing to the very first time this concept of headless NUMA node into the ecosystem. And there are a lot of innovations in that space to first understand how tiered memory systems are working, and then optimize for those tiered memory systems.

So one of the things that we do from an AMD side is provide all the architectural and technical hooks to improve from in on our CPU so that we can improve the performance of a tiered memory system. And finally, we have the ability to enable disaggregated memory systems as a proof of concept so that we can build systems of the future that enable disaggregated memory, if you will.

And then finally with the AMD EPYC processors, we were able to pull in some of the features that were defined in CXL 2.0. An example for that is persistent memory, so we could enable persistent memory support starting with Genoa. So when I look at what are we doing and what are we bringing with the four Gen M AMD EPYC processors, we're really bringing these six different use cases that are really, really important for our customers and bringing that on the very first generation of the processor is unprecedented for any technology development that I've seen. Um, so we're very proud about it and the way we brought it, uh, to the market.

Allyson Klein: Mahesh, you just described an amazing value proposition of new capabilities with CXL. Um, what has the customer response been? I know that the large cloud providers are very deeply involved in the consortium, but how has the broader market response been, and do you feel that enterprise. by and large have really understood what is about to be a available to them with their infrastructure.

Mahesh Wagh: Yeah, I, I think we're starting to see, both from a large, you know, cloud provider's perspective, right? I think one of the key things pretty much across the board, redwater, are we doing right from, from an AMD perspective, we're on the forefront of driving our core scaling, right? We're bringing more cores, more capabilities into the system. To support those cores, to support the, you know, bandwidth requirement and the capacity requirement. There are certain constraints on what we could do based on the existing memory technologies. So at the very first go, CXL is addressing some of the shortcomings by providing a flexible operation, you know, uh, opportunity to either meet the memory capacity or the memory bandwidth requirement by extending to CXL. Now, it has certain T C O advantages that you can take benefit from, and those things aren't only limited to large cloud providers. For example, on the enterprise, if you're deploying large in memory database sort of a system, you can start to take advantages of what CXL has to offer from a TCO perspective.

If your applications are targeting for high performance computing or, uh, you know, applications that require more bandwidth, CXL is a way to provide that more bandwidth at an effective cost. And one of the things that we're going to see as we deploy more CXL is you would be able to look at your applications and profile them in terms of their performance requirements.

And once you understand then there would be a set of, you know, some preliminary results indicate that, you know, 25 to 30% of applications are not latency sensitive. So if you can map those applications into CXL, it now allows you to deploy a solution where for your most application performant needs, you're targeting direct attach memory for other applications you can target into this other tier.

Right. So it's starting to open up these sort of discussions both on in the cloud as well as in the enterprise, where people will start to understand the value that this is bringing and then understand that and then see how you can make use of them for the applications that they're going to deploy.

Allyson Klein: You know, we're at MemCon and memory is obviously central to CXL and what it can bring to the table. Let's just take a step back for a second and ask the simple question, why is memory capacity important to applications and what's driving that where do you see the near term opportunity for CXL to really make a difference with

Mahesh Wagh: Yeah. Um, I'll start with two fronts on that one. One I kind of sort of addressed in the previous question, which was really from a core scaling perspective, right? We're just looking at if you were to not even change the applications that you have, and if you're looking at, The amount of cores that we are adding, and from a core scaling perspective, we had to have solution that can keep up with the bandwidth demand and the capacity demand to feed the cores.

And the memory technology isn't necessarily keeping up with that. We have some constraints, either platform constraints, channel constraints, memory technology constraints that are not scaling at the ramp that we're scaling the cores, right? So at a get go, you need a solution that's providing you the flexibility.

Second one is from a memory capacity perspective. You know, one of the points that we've understood is as the applications are improving, sort of their capabilities, we're seeing the capacity that is needed for an application grow every year. Right? So there's a demand for more memory capacity for a given application, and then new use cases such as, you know, AI and ML, they're embedded. Tables that you need. For example, the recommendation engine and of that sort, they're growing exponential. So as we look at that, that growth, it's creating a demand for more capacity. Right? And then how do you address capacity? You have all of those constraints that I talked about, memory, scaling, and then memory is significant cost of a data center, right? So the price is also increasing. So what are the ways you can optimize for that? And CXL provides you the opportunity to innovate and bring solutions to the market that would meet the application needs for growing capacity, as well as the needs for the system to feed the cores both in capacity and bandwidth.

So that's what's driving that and mem is the perfect place because you got all of the folks who are focused on memory technology come together. I do expect a lot of traction on CXL and a lot of talks related to CXL, that are going to be center of, uh, the discussions at MemCon.

Allyson Klein: What are AMD's plans for leadership in this space moving forward, and how do you see the evolution of the technology in terms of deployment in the next few years?

Mahesh Wagh: We're putting together with fourth gen AMD EPYC, we're leading the space with our processor with very, very innovative capabilities, right? And, and we are really hitting on the six to seven different use cases that our customers are, are targeting for. Some of them are more mature, others are in the development phase, right? But we see this very, sort of a nice roadmap for how these features are gonna come out. At the heart of that, uh, in terms of, you know, what's the leadership and I, I keep telling all of the teams that I engage with is, On the forefront is we gotta prove that CXL is functional and performing right, which means we would start with memory expansion, direct attached memory expansion with DDR four, DDR five memory.

And we are working, with the entire ecosystem, with the controller vendors on their architecture very, very closely to make sure that we can bring performance solutions to the market. And it'll start with the, uh, AMD EPYC and with a lot of our partners, and we're seeing in this year, these solutions come to the market. We have a production CPU, we're expecting production level devices available in 2023. That would be what would start sort of this adoption of CXL. What follows that is just building on top of these capabilities, right? You bring in direct attach memory expansion and you extend the capability to bring, you know, work with the ecosystem to enable these tiered memory solutions.

And then the optimize tiered memory solutions with, you know, developments in the ecosystem to improve performance. And once that's established, we see that setting a stage for a disaggregated memory, persistent memory, and lots of the use cases that are following. So that's how we see this as, you know, sort of this crawl, walk, run, run approach.

Start with direct, attach memory expansion, and then build on top of. and you know, it's going pretty, pretty good. I'm, I'm very happy with the, with the sort of progress that we're seeing in the ecosystem. And, uh, you know, it takes a village, it takes everybody, right? It takes CPU vendors, the ecosystem, the software development, all of it to get together to lift this technology up.

This isn't just one player, it's, it's, it's an ecosystem that'll need to get together to drive it. And events like MemCon and other events are really important because they bring people together and drive the technology forward.

Allyson Klein: Mahesh, when you look at the consortium itself, you've released a 3.0 spec. That's gonna take us, you know, through a few years at least before we start seeing 3.0 solutions at scale. What is next for the consortium in terms of making sure that this technology is adopted well and performs as, you and the the technical

taskforce expects

Mahesh Wagh: Yeah. Um, I think one of the things, the question is, you know, why, you know what is happening with 3.0 and how did it come about? Right? There were all of these use cases and, uh, you know, interests that sort of the ecosystem had and requirements that were coming into the consortium, but we had to look at it and sort of put that out in the spec version of how do we do incremental development, right? So with 3.0, you know, now we, you know, with CXL one, you bring key features in with CXL 2.0. You provide some scalability, you add some extensions for what didn't exist in 1.1, persistent memory as an example, and CXL 3.0 point then finishes it with providing you, you know, the, the sort of scaling factor for this capabilities.

The direction for the consortium is now that we've defined that, give it a little bit of space for all of these technologies to mature the products to come into the market and then start thinking about. You know, the next generation of CXL, uh, 4.0. So we're gonna see some amount of slowness in terms of, you know, the next version of the spec.

And primarily it is for us to be able to deploy solutions, get some feedback from what exists, and you know, what the experience has been, and then drive that forward. That doesn't mean the innovation will stop , we will continuously look at CXL 3.0 and beyond for key features that are really important and can't wait for the next generation to be brought in as either ECNs, which is engineering change notices, you know, things like that.

Um, but the whole direction is now that we've laid out what it looks like from an ecosystem perspective, also helps you to kind of look at it and say, what is the end goal in terms of the overall scale out capability? Where can you start and then build it? So it's set up for that crawl, walk, run approach.

Um, we'll still at the crawl stage from an ecosystem deployment perspective, but then the vision is laid out. The path is there for the ecosystem to go drive, um, together.

Allyson Klein: That's fantastic, one final question for you. You've put out a lot of information both on CXL as a technology and AMD's plans. Where would you send folks for more information

Mahesh Wagh: More immediately, if anybody's listening to it, and if you happen to be at Mem Con, just reach, you know, attend the sessions or reach out, that would be the first place. Outside of Mem Con, um, CXL Consortium is a good place if, you know, if people wanna know more about it, they could reach out to the consortium.

The consortium does a fantastic job of releasing webinars, training materials, tutorials for those who are either new to the technology or those who are well entrenched in the technology. Also want to learn more, right? So all of that material is available. There are periodic, uh, training sessions that the consortium does.

If you wanna find specific information about what AMD is doing, you can find me on LinkedIn or if you are coming in through a company, you can engage with your AMD rep and they know how to connect to, you know, the technical folks. So that would be the way to sort of get connected, uh, with the technology.

Allyson Klein: Fantastic. Thank you so much for being with us today, Mahesh and giving us, this great primer on CXL and how it's going to impact, data center infrastructure. I can't wait to see more as we

Mahesh Wagh: Yeah, we we're excited about it and, and, and we are really, really excited to bring this out into the market and, you know, happy to talk to as many people, uh, in bringing this technology out. So, like I was saying, it's a village that require all of us to get together to drive this forward. Thank you.

Allyson Klein: Thanks for being here

Allyson Klein

AMD MemCon Interview Transcript

MemCon and the Path Forward with Memory in a Post Moore’s Law World

Breakthrough Data Center Platform Innovation with AMD

TechArena