Rebecca Weekly Podcast Transcript

Dec 13

Allyson

Welcome to the Tech Arena. Today we are continuing on our journey of cloud computing entering 2023. And my guest today is somebody that needs no introduction in the cloud space. She was just named one of Business Insiders Cloudverse 100 as a builder of the next generation of the Internet. She's also the VP of Hardware Systems Engineering at Cloudflare. Welcome to the program. Rebecca Weekly.

Rebecca

Thank you, Allyson. It's so good to see you.

Allyson

So, Rebecca, you recently made a big change in your career, moving from the land of infrastructure into managing oversight of infrastructure at one of the largest cloud service providers. Tell me about that transition and what it's like to be so close to the customer at Cloudflare.

Rebecca

Well, I think, honestly, that was the learning part of the journey that motivated me. I spent my whole career in semiconductors in EDA tooling for semiconductor development, very much in the nitty gritty, and had the wonderful opportunity within Intel to work on the systems that we build for cloud service providers. Holistically kind of across compute and storage and networking solutions. As much as I felt I could learn and grow in that domain, I felt like I couldn't go there's just certain things that you will learn as you operate. Something that helps you understand what needs to happen in the products underneath what you're doing versus being underneath and trying to sort of orient up to yes, they would want to use it in this way. Some people can do that. Maybe if I had my crystal ball a little more polished, I would have been that much better at it coming from the direction of bottoms up silicon to systems.

But when you are able to work and call up the director of the Radar team or one of the other wonderful products and services that we have and say, okay, what experience problems are you having? And their words are going to be something like well, it seems really slow but slow to a software person could be a million different things to a hardware person many of which aren't actually even hardware. But what is hardware could be the network design, the actual bandwidth of the network redundancy factors, factors relating obviously to the performance of the computing element and the bandwidth into. So there's like 50,000 things that flow could mean and when you are trying to orient about that from one small piece of integrated compute device whatever it might be that I see that somebody's working on it's an impossibly difficult problem to model effectively when you're actually working with customers who have millions of CPUs operating at scale. In our case, over 275 different cities that we're operating in. That global distributed network has all sorts of effects that aren't going to show up in a node level test on speculation apples and oranges. So in this you've described what is slow and trying to root cause that?

Allyson

Are there a few things as you've made this transition that were like the big AHA moments for you of like oh my gosh, I never even thought about that or oh my gosh, this is something that I need to feedback to the industry that we need to work on together.

Rebecca

I probably have those every week at least in the first six months. I'm pretty sure I had them every week and I will do a poor job summarizing all of them but many aspects of supply are always underestimated in terms of the impact that it has on folks. There are many times where we've evaluated something that looks really good on paper in lab, and when push comes to shove, we end up not actually deploying it because we can't really get it. It in the kind of scale we would need or the timeline to actually get that device through. Our supply chain is nine months, three quarters, based on lead times of various components, et cetera. So supply capability and not just the individual pieces of silicon, but the systems associated with them, is such a problem. And lots of people, like vendors all across that we work with are like, but look, my benchmark looks so good. Don't you want it tomorrow? Why aren't you buying tomorrow? And it's like, Let me talk you through what it takes for us to actually acquire your CPU or your XPU, which is even harder, by the way, than a CPU to actually get availability of. And the list goes on.

I mean, it may be QLC and the drive like lead times for such things. DDR5 the lead times. This is not an industry where everyone's left edge of readiness aligns to any one vendor. And so there's a lot of moving parts in the world of hardware that really do mean, yeah, performance matters. Really matters, but. You're going to end up making choices with the systems you've got if you're not working with partners who know how to deal with compliance issues, the ability to ship to different locations, all these other factors that I lump into supply that are really impactful operating a global service. So that's definitely one big bucket.

Allyson

Performance matters, but the customer matters most and you need to service the customer

Rebecca

And zero supply means you get a zero on performance. Sorry, I hate to say this to you, but you can have the best thing ever. And if I can't buy it, it doesn't actually exist. So that certainly I've had that conversation so much and I think because there's so many new companies and newer players in the industry right now, which is so great, it is a golden age of silicon innovation that is happening. But they may have amazing architects coming up with really amazing pieces of silicon and didn't invest in the humans that are required to manage their supply chain, to manage their assets, to forecast correctly, to handle logistics and shipping. These are real things that a mature company has had to think through that some of our newer players don't know yet. That's expected, but it's a big deal.

The other thing that I also am seeing with a lot of our partners is software readiness in the ecosystem for similar reasons. This shows up as kind of two different flavors just to roughly bucket them. One is in the domain of building their own because it's awesome and amazing, and because it looks so good on this thing, then of course you're going to want to use it. And that's insane because we're a company that's been around for twelve years. We have huge investments, many of them off of opensource projects, but some of them off of our own internal development or have branched quite a long time ago, and they're not actually fully aligned to what is currently in an open source project. And we have security concerns. We operate in terms of compliance. There's fifth certification on certain libraries. Anytime you start diverging from either the opensource community or at least we have inspection capabilities for security reasons into something that's all your own. Like you're never going to sell me. It's never going to happen. I mean, I shouldn't say never, never say never, but absent, like you put a barrier to entry where you better be NVIDIA like you better be that much more performance that you're forcing people to use things like CUDA.

And if there's any way in the world that I can use TensorFlow or PyTorch or anything else with a community behind it, I will. Because being so locked in to one solution is illogical, won't pass my security requirements, doesn't have the benefit of the community looking at it, inspecting it, banging on it, making it better. So that's one flavor of software readiness where it's like, I get it, it's easier if you control it, but it really means I can't ever use you more or less good challenges that you're setting this interview up with.

Allyson

You're also a member and deeply involved in OCP, and you talked about opensource. What is the role of standard configurations in addressing the former challenge? And can standard configurations help with supply?

Rebecca

Absolutely. Something we spend a lot of time talking about, an Open Compute Project is that open software is not the same as open hardware. And there's a reason for that, right? Opensource software projects are on a GitHub repo. Everyone can access, everyone can contribute, and it's fully inspectable. You're able to kind of engage in that. And then people just say, hey, I certified that version of the Linux kernel versus this one. And it works in that fashion. With hardware, you own an asset at the end of this process, and somebody has to produce that asset for you. Whether it's a server, whether it's a NIC card, whether there's a combination, usually of several integrated chips, and then several different components coming together. And whether you're using an ODM or an OEM, somebody has develop firmware bio solutions, et cetera, on top of that box to give you a finished functional part. And that's true whether it's a server or whether it's a white box.

So what we do with an Open Compute is drive standards to help ensure that as many components as possible can be interoperable across different vendors. It doesn't mean that the net end server is an opensource thing that anybody could build and anything could happen in the same way that an opensource project would be. People's IP for their integrated chips, for their server design is still their own IP and they have that right. But we create open standards around the interconnect connection points so that we can ensure that if you buy a DCSEM or an OCP 30 NIC card from any number of vendors, they're going to plug into the PCIe slot the exact same way that anybody else's OCP 30 card plugs in. And you can get different form factors of a compliant card to that specific design so that you can accommodate a half with board one U, two U, whatever is right for your specific environment. So it is a little different. We do that through sort of process of contributions and defining the subset that will make interoperability possible but still enabling people to own their own IP.

And honestly, I believe that is important for three reasons. One, as a consumer, I need to know to your supply comment and question earlier I need to know if I choose to validate a NIC, I can get another supplier of that NIC if my current one isn't available and ideally that it's going to be. I mean, I'm still going to test it before I put a new vendor's version of it out there. But in general, it's going to be a very short turnaround time to validate a different version of an OCP 30 NIC because they're all compliant to a standard. So that helps with sourcing capabilities as a user. It helps with supply assurance and it helps with validation timelines because everything is faster when I know it's going to plug into my box.

Now, even from an ecosystem perspective, I would argue this is better for It ecosystem providers because they're not making big investments just to move a NIC card around, right? And every person's environment often is different. Some people like to have their management network directly on their Nic card with a different one gig network interface versus their standard interface for their overall consumer network. Some people like to have a separate switch itself at the tour level. Everyone's different in how they want to do their systems design for their reliability concerns and challenges. If you can standardize subcomponents so things are more aligned, then we get a much easier process as developers and manufacturers ODMs and OEMs in producing widgets that people can adopt quickly and not spending a lot of time in redesign.

And then the last community that we serve are obviously silicon providers as well. And those folks knowing that new players can come into the market, use a form factor that's been standardized and be adopted and have a supply chain ready for them faster is a real advantage as well. So ideally open Compute helps in all of those ways when it's a healthy, vibrant community doesn't work. I believe OCP is an amazing organization for this. And the last thing I would add about that is it's not just about supply assurance, it's also about sustainability because there's a lot of especially as we're seeing more bifurcation of options in the market. If I have to do a new full motherboard design. It's a huge statement locked in for every CPU vendor, every DPU, IPU, XPU. It's prohibitively expensive and it's horrible for the environment. So if I can take standard building blocks, try things in my lab, swap them out, be able to do the right thing. Validate if this actually makes sense for us, before going big big and doing this big server design that makes it faster to adopt new technology, more interesting for new companies to get into, and a lot less e waste in the world. I mean, about 70% of a server doesn't need to be redesigned gen on gen to get performance efficiencies. And that 70% e-waste that can, through modularity kinds of initiatives, not happen. I mean, we're going to have to build new servers for capacity needs, but there's no reason why we can't get to a mindset where we swap the CPU and the memory and everything else keeps going for ten years, 15 years, instead of regenerating those fans and the CPLD and the BMC every single generation. Wow. Bye. What's really changed in those devices?

Allyson

That's a great point. Yeah, that's a great point. And so poignant and it kind of gets me into my next question. Sustainability has become a bigger issue and not that it wasn't always an issue, but rising energy prices have put it at a 10X of focus. What have you seen in terms of the shift in your attention to performance efficiency? Has there been a shift or were you already there and are customers asking about performance efficiency as well?

Rebecca

So I would argue that given that we've run a global network where we've been subject to the fun challenges in terms of pricing in all sorts of different global markets, we've always been very sensitive to performance efficiency. We were experimenting and testing the Amber Wing solution, which was one of the very first ARM based solutions back in 2015, to see if there might be alternate options out there that could be more efficient. So Cloudflare has a strong history in trying to make sure we are being as efficient as possible in serving the Internet. And that includes looking at accelerator solutions, looking at all sorts of options, just because we've always been exposed. We care about the eyeball responsiveness and so you're going to build in places that do not have a good PUE, like high humidity, expensive power sources all the time.

Now everybody is on this bandwagon, which is actually in some ways very nice because it means there's so much more focus and interoperability and core capacity out there. I think what's really changed that I've seen has been actually in the software ecosystem because the number one metric for software developers for the vast majority of time I've been on this earth has been efficiency of development time, the agility of the development, agility of the team. And I'm starting to see in the software ecosystem, people trying to figure out is this an efficient way of doing it? Are we being logical? And that's a huge shift. It's not fair. Everyone thought that way in the 60s, but it was because they had no actual memory to use and they had to optimize all of their time facing. But in normal programming it's been about developer agility. And now people are starting to really look at developer efficiency, code efficiency. And that I think is critical because if you go look at the GHG protocol they'll tell you of the. 60 tons of carbon embodied in a server. 90% of the carbon emissions associated with the server are the operation of it. So we can get as good as we want to in the supply chain, in the reuse, in the recycling practices and it's still going to only move the needle on 10% the operations which again, we can do some great things with different architectures. If we have crappy code on there that is sitting and interrupting the processor 24/7 this is not going to be efficient for serving its overall objectives.

And so I do believe this is a problem that as much as hardware is going to work on it, I am most excited to see the software teams getting excited about it and working on it because that's where we're really going to move the needle.

Allyson

I'm interested in this efficient code and the last conversation I had on Cloud was with Abby Kearns, former CTO of Puppet. And we were talking about the state of app stack automation and the complexity that we've created with cloud with the number of workloads and the complexity of the stack. Do you see us making progress in the efficiency just from a standpoint of the cloud stack and actual allocation of workloads?

Rebecca

So the biggest factor I've seen in increasing the efficiency of a server is containerization, right? Virtualization containerization just upping the number of users on a system. Given that multicore architectures exist, I don't think we've hit some tipping point on complexity with respect to containerization or virtualization. Organizations who provide packaged services in this domain or opensource projects in this domain are incredibly successful and I tend to think about problems in, you know, the pareto rule of 80/20. So if I can get, you know, 80% more usage, if not 80% more, but let's the average single tenant server is usually about 10% occupied and one that is supporting containerization or even virtualization has the opportunity to be closer to 45% to 65% load. So that's a huge increase in improvement in reducing that 90% number is just adopt containerization. It is complex and container management solutions. Whether anybody who's operated a Kubernetes cluster, which I don't do personally, but I know the gentleman who helps run the team, and it's a lot of work. It's a lot of work to do distributed systems at scale and make sure that they stay up and are consistent and all of those factors factors. It is a hard problem, but I don't think the easier, lower hanging fruit parts of it are unsolved. I think there's really good technology happening in that domain. The specifics of writing more efficient code. I think this is one of those areas where we're going to have to start with a mindset that shift for developers. And there's this great book called Nudge, but it really talks about how showing people data, they start to make changes that enforcing choices doesn't work very well.

I think one of the most important things we can do as engineering management, as leaders in the industry, is to show people data about the consumption of their processes, to show people data about the carbon footprint of the choices they're making, both consumers, by the way, as well as developers. Like, if I am using a service and it told me, oh my gosh, you're doing this in high death and you are taking ten times the computing power and therefore ten times the emissions footprint, as if you were watching this in Standard Def, I might choose to go to Standard. It might work better anyway, since I'm probably on my cell phone on a treadmill. So it's actually not a bad thing to give consumers those choices. And similarly, I think if we give developers better tools, there's some great tools out there and it's Arian Vandovan wrote like Powertop and has contributed that into the opensource ecosystem. There's a bunch of really fantastic tools that the community is starting to put out there. And if we build those into our development pipelines and help ensure that our developers can see that, People's own desire to do better for the world will help. I mean, that is a positive Pollyanna totally Rebecca statement, but it is so true in my world's view. Like, everyone is good, most people are good, and most of us want to make the right choices for each other for so I think there's a nudge that's worth making towards helping make sure we're exposing. And I think this is a challenge because a lot of cloud providers want to have a magical experience and don't want you to have to think at all about the hardware that's included. Trust me, if a customer has to talk to Rebecca, something went really wrong. Really wrong. But there is options to expose the impact and think you're starting to see that of choices and maybe you'd be willing to spend a little bit more on that cloud instance to know that the power source behind it is 100% green and that it's using a processor that has 100% water recycling. Or I think maybe people can be a little bit better if we start to show them instead of just taking all those decisions away.

Allyson

I'm going to go to that small part of the population that is not good for a second. Cloudflare published a paper this week on DDoS attacks and I was reading it and it brought to the attention that this is an area which is evolving by the minute and a race between the industry to protect data and protect customers and bad actors and looking to exploit situations. Where do you think we're at? And what role does infrastructure have to play in a root of trust have to play in terms of security?

Rebecca

There are so many answers. Okay, where do I think we are in the world? You're absolutely right. We published some great work during the beginning of the conflict between Russia and Ukraine about some of the ways in which the infrastructure itself can show you changes in not just DDoS situations right. But even just changes in upstreaming and data content and data access. And potentially what that means. The state of the Internet is that it is a globally distributed system, system and physical access. Therefore of where you send your data goes through lots of places you may not feel comfortable with. I think governments are taking actions to try and create regulatory environments that keep user data more protected. I think. Obviously, companies have a responsibility to take action either through leveraging services that control and support a secure, secure access edge or through companies like us who work to have fully encrypted servers to make sure that we have disaggregated. Root of trust that we are signing all of our certificates, that everything is there's a lot of best practices in security and nothing is 100% secure. And I think the goal is to layer so many parts of that cake that you are not the easy pickings out there in the world. And to recognize that this is a very complicated problem because of the nature of the Internet and we should use our brains and question things and be smart consumers and think through have we really created a situation here that's logical?

Maybe that's a different podcast and a different conversation, Allyson, but in general, absolutely. Hardware has a huge burden in this. There's a lot of solutions coming out to make that faster. But even if you don't have it in the hardware, there are options and have been options in software for a long time. In terms of network security, whether it's VPN or some form of SASE intercept, whether it's hardened by hardware or not, even just a software set of good practices is 100% better. My favorite case of this in the news in the last, whatever, six months is two factor authentication through a true FIDO security key. Like how many companies got exploited because they thought two factor auth was good enough. And no, really, it is a lot harder to spoof a hardened security key and I don't care how many times you have something text you a different code. Yes, that's better defense and depth, but it is not as good as having a hardened, in that case, security key. And we continue to see this it's like good, better, best. We start with software, we start with encryption, we start with a layered model. We start with a lack of trust model where people have to ensure that they are compliant to the user they are and the behavior patterns of users that are like them and have those access patterns. And we spend a lot of time on that mean I in no way want to downplay the importance of the incredible services built 100% in software that help ensure people are actually having best security practices. But as we layer in hardened security behind that, that is harder to spoof. I won't say impossible to spoof. Everyone's read CVE's out there knows that nothing's impossible but it just makes it that much harder to ensure or to break the encryption, to break the security model, to break the key schema. And I think that's what we're all trying to do disaggregated root of trust is not because people haven't had some sort of a root of trust concept, but if that has been commingled with your BMC you're innocent situation where if the BMC has violated which go read the CVD database unfortunately this is not an uncommon situation. You've been trusting an entity that is…not trustable. Not trustworthy. So that's really where a disaggregated root of trust in your Attestation chain as you are trying to make sure your keys are accurate. Makes sense, but there's no one panacea. I just see it every day going through and trying to reduce the issues every single day. This is an area where we will constantly need to be innovating, we will constantly need to be working as a community to create better solutions.

And I think it is an incredible time because all the biggest players in the industry are working on this and most of them are actually driving standards into Open systems. Open Compute Project. So major projects were announced at our last global summit, Hydra being one implementation. I mean, obviously everything that has happened with OpenTitan, but all of the work that has happened in that domain for servers specifically are really exciting for the industry. And I think it really is showing like, security is a differentiator but system level security. You're only as good as your weakest link. And so if we don't bring up the whole industry, we're in trouble. And I think that's a huge amount of leadership from Google, from Microsoft stepping up to say, hey, we're only as strong as our weakest link, let's make sure the industry is better.

Allyson

That's fantastic. One final question for you. We're heading into 2023. What are the exciting things that we can expect from the Cloudflare team and what are you most excited about to see from the industry next year?

Rebecca

Oh, gosh. Well, something that we talked about a little bit today in this Cloudflare world, there's so much data and insights running a global network specifically targeting, like, reducing DDoS attacks, making sure that the Internet is more secure. I look forward to seeing our teams take the mic and talk a little bit more about threat intelligence and all the different ways in which, you know, we can help consumers be smarter about. That won't be my team at all, but just for the sake of the world so that people can understand more. I thought the papers and work that we did around Ukraine were incredibly powerful, and I really look forward to seeing the team expand that work because it's some of the stuff that inspires me most every day on building a better Internet.

For my team, we are building our next generation of modular server, which is super exciting. Again, for all the e-waste conversation we had earlier. So that's been a lot. We are working actively on white box switches and solutions to both have more inspection and capability through using best server design techniques in the networking domain, as well as having the network sort of enable us to build what we want to do. So I'm really excited about that.

I mean, I'm totally geeking out about hardware stuff. And the accelerator ecosystem continues to evolve, continues to be interesting. So I have at least three different ASICs varieties of ASICs. Actually, I have at least two vendors for most of them in lab right now that we're starting to experiment on to improve the accuracy of time pools for running a global network to increase our efficacy in serving machine learning and analytics. So, so many different domains. And obviously one of our newest services that launched this year was R2. And as R2 continues to scale, going from being a global distributed network to being a computational network to actually deliver object storage on top of that, there's so much transformation that is happening in our footprint, in our builds in durability and latency requirements and end user services. So it's been an incredible learning journey with that team to date and I just am ecstatic to continue to build that to make it better, stronger, faster for our users.

Allyson

That's fantastic. Thank you so much for being on the program today, Rebecca. It's always a great chat.

Rebecca

Thank you for having me.

Allyson Klein

Rebecca Weekly Podcast Transcript

Going Down the Rabbit Hole with ChatGPT

The Rising Demands on Cloud Infrastructure

TechArena