Hacker News new | past | comments | ask | show | jobs | submit login
Speed, scale and reliability: 25 years of Google datacenter networking evolution (cloud.google.com)
290 points by sandwichsphinx 12 days ago | hide | past | favorite | 78 comments





This mentions Jupiter generations, which I think is about 10-15 years old at this point. It doesn't really talk about what existed before so it's not really 25 years of history here. I want to say "Watchtower" was before Jupiter? but honestly it's been about a decade since I read anything about it.

Google's DC networking is interesting because of how deeply integrated it is into the entire software stack. Click on some of the links and you'll see it mentions SDN (Software Defined Network). This is so Borg instances can talk to each other within the same service at high throughput and low latency. 8-10 years ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now but that's just a guess.

But the networking is also integrated into global services like traffic management to handle, say, DDoS attacks.

Anyway, from reading this it doesn't sound like Google is abandoning their custom TPU silicon (ie it talks about the upcoming A3 Ultra and Trillium). So where does NVidia ConnectX fit in? AFAICT that's just the NIC they're plugging into Jupiter. That's probably what enables (or will enable) 100Gbps connections between servers. Yes, 100GbE optical NICs have existed for a long time. I would assume that NVidia produce better ones in terms of price, performance, size, power usage and/or heat produced.

Disclaimer: Xoogler. I didn't work in networking though.


The past few years there has been a weird situation where Google and AWS have had worse GPU's than smaller providers like Coreweave + Lambda Labs. This is because they didn't want to buy into Nvidias proprietary Infiniband stack for GPU-GPU networking, and instead wanted to make it work on top of their ethernet (but still pretty proprietary) stack.

The outcome was really bad GPU-GPU latency & bandwidth between machines. My understanding is ConnectX is Nvidias supported (and probably still very profitable) way for these hyperscalers to use their proprietary networks without buying Infiniband switches and without paying the latency cost of moving bytes from the GPU to the CPU.


Your understanding is correct. Part of the other issue is that at one point, there was a huge shortage of availability of IB switches... lead times of 1+ years... another solution had to be found.

RoCE is IB over Ethernet. All the underlying documentation and settings to put this stuff together are the same. It doesn't require ConnectX NIC's though. We do the same with 8x Broadcom Thor 2 NIC's (into a Broadcom Tomahawk 5 based Dell Z9864F switch) for our own 400G cluster.


Nvidia got ConnectX from their Mellanox acquisition -- they were experts in RMDA, particularly with Infiniband but eventually pushing Ethernet (RoCE). These NICs have hardware-acceleration of RDMA. Over the RDMA fabric, GPUs can communicate with each other without much CPU usage (the "GPU-to-GPU" mentioned in the article).

[I know nothing about Jupiter, and little about RDMA in practice, but used ConnectX for VMA, its hardware-accelerated, kernel-bypass tech.]


From memory: Firehose > Watchtower > WCC > SCC > Jupiter v1

This latest revision of Jupiter is apparently 400G, as is the ConnectX-7, A3 Ultra will have 8 of them!

I would guess the Nvidia ConnectX is part of a secondary networking plane, not plugged into Jupiter. Current-gen Google NICs are custom hardware with a _lot_ of Google-specific functionality, such as running the borglet on the NIC to free up all CPU cores for guests.

Like most discussions of the last 25 years, this one starts 9 years ago. Good times.

The Further Resources section goes a bit further back.

It seems all cutting edge datacenters like x.ai Colossus are using Nvidia networking. Now Google is upgrading to Nvidia networking, too.

Since Nvidia owns most of the Gpgpu products, they have top notch networking and interconnect, I wonder if they don't have a plan to own all datacenter hardware in the future. Maybe they plan to also release CPUs, motherboards, storage and whatever else is needed.


I read this slightly differently, that specific machine types with Nvidia GPU hardware also have Nvidia networking for tying together those GPUs.

Google has its own TPUs and don’t really use GPUs except to sell them to end customers on cloud I think. So using Nvidia networking for Nvidia GPUs across many machines on cloud is really just a reflection of what external customers want to buy.

Disclaimer, I work at Google but have no non public info about this.


Having just worked with some of the Thread folks at M&S, thought I'd reach out and say hello. Seems like it was an awesome team! (=

You're lucky to be working with them, an amazing team.

Nvidia networking is what used to be called Mellanox networking, which was already dominant in datacenters.

Only within supercomputers (including the smaller GPU ones used to train AI). Normal data centers use Cisco or Juniper or similarly.well known Ethernet equipment, and they still do. The Mellanox/Nvidia Infiniband networks are specifically used for supercomputer-like clusters.

Mellanox Ethernet NIC got used a bunch of places due to better programmability.

Mellanox IB were ubiquitous in storage networking. None of the storage systems I worked on would have been possible without mellanox tech.

Most people on the TrueNAS/FreeNAS forums use 10gb Mellanox nics that are sold on eBay. The sellers get them after server gear gets retired.

You seem to have a narrow definition of “normal” for datacenters. Meta were using OCP mellanox NICs for common hardware platforms a decade ago and still are.

I have to wonder if Nvidia has reached a point where it hesitates to develop new products because it would hurt their margins. Sure they could probably release a profitable networking product but if they did their net margins would decrease even as profit increased. This may actually hurt their market cap as investors absolutely love high margins.

They can always release capital back to investors, and then those investors can put the money into different companies that eg produce networking equipment.

Why would they release money if they can invest it and return much more?

I was working under HDThoreaun's assumption that the margins would be lower.

If they have other opportunities for investment with higher margins, they should seize those, of course. And perhaps even call up investors for more capital, if required.


When the agents employed by investors would be harmed by releasing capital back, which is guaranteed since so many people’s compensation is in the form of stock and returning capital leads to decreasing stock value, why would those agents ever return the capital voluntarily?

Buybacks increase stock price. Nvidia has a tiny dividend but most shareholder returns are in the form of buybacks.

Have you heard of stock buybacks?

I believe this is what they plan on doing. See, for example:

https://www.youtube.com/live/Y2F8yisiS6E?si=GbyzzIG8w-mtS7s-...


Grace Hopper already includes Arm based CPUs (and reference motherboards)

Yeah there’s a bit of industry worry about that very eventuality — hence the ultra Ethernet consortium trying to work on open source alternatives to the mellanox/nvidia lock-in.

https://ultraethernet.org/


Interesting Nvidia is on the steering committee

Cisco have sat on the steering committees for a lot of things where they had a proprietary initial version of something. It's not that unusual, and also, it's often frankly not actually that open; e.g., see the rent seeking racket for access to PCI documentation, or USB-IF actively seeking to prevent open source hardware from existing, etc.

Eh, the UEC effort is a standards org through the Linux Foundation so it won't be subject to any of the usual chicanery. And actually, it looks like Nvidia is jsut a general member and not one of the Steering Committee members;

https://ultraethernet.org/wp-content/uploads/sites/20/2023/0...


That was their plan with trying to buy ARM...

Pretty crazy. Supporting 1.5mbps video calls for each human on earth? Did I read that right?

Just goes to show how drastic and extraordinary levels of scale can be.


Scale means different things to different people

Wow and it doesn't open with a picture of their lego server? Wasn't that their first one, 25 years ago?

It's a marketing piece, they don't particularly want to emphasize the hacky early days for an audience of Serious Enterprise Customers.

They managed to double from 6 Petabit per second in 2022 to 13 Pbps in 2023. I assume with ConnectX-8 this could be 26 Pbps in 2025/26. The ConnextX-8 is PCI-e 6 so I assume we could get 1.6Tbps ConnextX-9 with PCI-e 7.0 which is not far away.

Cant wait to see the FreeBSD Netflix version of that post.

This also goes back to how increasing throughput is relatively easy and has a very strong roadmap. While increasing storage is difficult. I notice YouTube has been serving higher bitrate video in recent years with H.264. Instead of storing yet another copy of video files in VP9 or AV1 unless they are 2K+.


Does gcp have the worst networking for gpu training though?

For TPU pods they use 3D torus topology with multi-terabit cross connects. For GPU, A3 Ultra instances offer "non-blocking 3.2 Tbps per server of GPU-to-GPU traffic over RoCE".

Is that the worst for training? Namely: do superior solutions exist?


[flagged]


Just like Apple is doomed…for the past 20 years.

A friend of mine bet me $1000 in 2004 that Apple would be out of business in 5 years. Best bet i ever made, but if only I’d put my winnings in AAPL!

Speed, scale and reliability

Choose any two.


In any decision making matrix you need a constraint that get consumed (economics, size, etc) to force a "choose any two" type situation.

You absolutely can have speed, scale and reliability. You can't have speed, scale, reliability and low cost.


Which of those is Google's network missing?

The most amazing surveillance machine ever …

Awesome Google... Now learn what an availability zone is and stop creating them with firewalls across the same data center.

Oh and make your data centers smaller. Not so big they can be seen in Google Maps. Because otherwise, you will be unable to move those whale sized workloads to an alternative.

https://youtu.be/mDNHK-SzXEM?t=564

https://news.ycombinator.com/item?id=35713001

"Unmasking Google Cloud: How to Determine if a Region Supports Physical Zone Separation" - https://cagataygurturk.medium.com/unmasking-google-cloud-how...


Making a datacenter not visible from Google Maps, at least on most big cities where Google zones are deployed, would mean making them smaller than a car. Or even smaller than a dishwasher.

If I check London (where europe-west2 is kinda located) on Google Maps right now, I can easily discern manhole covers or people. If I check Jakarta (Asia-southeast2) things smaller than a car get confusing, but you can definitely see them.


Your comment does not address the essence of the point I was trying to make. If you have a monstrous data-center, instead of many smaller, in relative size, you are putting too many eggs on a giant basket.

What if you have dozens of big data centers?

To reinforce your point:

The scale of cloud data centres reflects the scale of their customer base, not the size of the basket for each individual customer.

Larger data centres actually improve availability through several mechanisms: more power components such as generators means the failure of any one is just a few percent instead of a total blackout. You can also partition core infrastructure like routers and power rails into more fault domains and update domains.

Some large clouds have two update domains and five fault domains on top of three zones that are more than 10km apart. You can’t beat ~30 individual partitions with your data centres at a reasonable cost!


I provided three different references. Despite the massive downvotes on my comment I guess by Google engineers, as a troll...:-)I take comfort on the fact nobody was able to advance a reference to prove me wrong.

AWS Zone is sort-roughly-kinda a GCP Region. It sounds like you want multi-region: https://cloud.google.com/compute/docs/regions-zones

> It sounds like you want multi-region

If you use Google Cloud....with the 100 ms of latency that will add to every interaction....


You haven't actually made an argument.

It is true that the nomenclature "AWS Availability Zone" has a different meaning than "GCP Zone" when discussing the physical separation between zones within the same region.

It's unclear why this is inherently a bad thing, as long as them same overall level of reliability is achieved.


The phrase "as long as the same overall level of reliability is achieved" is logically flawed when discussing physically co-located vs. geographically separated infrastructure.

Justify that claim.

In my experience, the set of issues that would affect 2 buildings close to each other, but not two buildings a mile apart, is vanishingly small, usually just last mile fiber cuts or power issues (which are rare and mitigated by having multiple independent providers), as well as issues like building fires (which are exceedingly rare, we know of, perhaps two of notable impact in more than a decade across the big three cloud providers).

Everything else is done at the zone level no matter what (onsite repair work, rollouts, upgrades, control plane changes, etc.) or can impact an entire region (non-last mile fiber or power cuts, inclement weather, regional power starvation, etc.)

There is a potential gain from physical zone isolation, but it protects against a relatively small set of issues. Is it really better to invest in that, or to invest the resources in other safety improvements?


I think you're undermining the seriousness of a physical event like a fire. Even if the likelihood of these things is "vanishingly small", the impact is so large that it more than offsets it. Taking the OVH data center fire as an example, multiple companies completely lost their data and are effectively dead now. When you're talking about a company-ending-event, many people would consider even just two examples per decade as a completely unacceptable failure rate. And it's more than just fires: we're also talking about tornados, floods, hurricanes, terrorist attacks, etc.

Google even recognizes this, and suggests that for disaster recovery planning, you should use multiple regions. AWS on the other hand does acknowledge some use cases for multiple regions (mostly performance or data sovereignty), but maintains the stance that if your only concern is DR, then a single region should be enough for the vast majority of workloads.

There's more to the story though, of course. GCP makes it easier to use multiple regions, including things like dual-region storage buckets, or just making more regions available for use. For example GCP has ~3 times as many regions in the US as AWS does (although each region is comparatively smaller). I'm not sure if there's consensus on which is the "right" way to do it. They both have pros and cons.


what happened in gcp paris region then?

One of the vanishingly small set of issues I mentioned.

It is true, and obvious, that GCP and AWS and Azure use different architectures. It does not obviously follow that any of those architectures are inherently more reliable. And even if it did, it doesn't obviously follow that any of the platforms are inherently more reliable due to a specific architectural decision.

Like, all cloud providers still have regional outages.


I think you should have started this discussion by disclosing you work at Google...

> One of the vanishingly small set of issues

At your scale, this attitude is even more concerning since the rare event at scale is not rare anymore.


I think you're abusing the saying "at scale, rare events aren't rare" (https://longform.asmartbear.com/scale-rare/ etc.) here. It is true that when you are running thousands of machines, events that happen rarely happen often, but that scale usually becomes relevant at thousands, or hundreds of thousands, or millions of things (https://www.backblaze.com/cloud-storage/resources/hard-drive...).

That concept is useful when the scale of things you have is the same order of magnitude as the rate of failure. But we clearly don't have that here, because even at scale, these events aren't common. Like I said, there have been, across all cloud providers, less than a handful over a decade.

Like, you seem to be proclaiming that these kinds of events are common and, well, no, they aren't. That's why they make the top of HN when they do happen.


To address the availability point of your comment, Google's terminology is slightly different to AWS.

On GCP it sounds like you want to have a multi region architecture, not multi-zone (if you want firewalls outside the same data center).

> Resources that live in a zone, such as virtual machine instances or zonal persistent disks, are referred to as zonal resources. Other resources, like static external IP addresses, are regional. Regional resources can be used by any resource in that region, regardless of zone, while zonal resources can only be used by other resources in the same zone.

https://cloud.google.com/compute/docs/regions-zones

(No affiliation with Google, just had a similar confusion at one point)


You also need to go multi-region with AWS. I liked their AZ story but in practice it hasn't avoided multi-zone outages (maybe deploys?)

[flagged]


This isn't even close to true. You can just go on Google Maps and visually see the literally *hundreds* of wholly-owned and custom built data centers from AWS, MS, and Google. Edge locations (like Cloud CDN) are often in colos, but the main regions compute/storage are not. Most of them are even labeled on Google Maps.

Here's a couple search terms you can just type into Google Maps and see a small fraction of what I mean:

- "Google Data Center Berkeley County"

- "Microsoft Data Center Boydton"

- "GXO council bluffs" (two locations will appear, both are GCP data centers)

- "Google Data Center - Henderson"

- "Microsoft - DB5 Datacentre" (this one is in Dublin, and is huuuuuge)

- "Meta Datacenter Clonee"

- "Google Data Center (New Albany)" (just to the east of this one is a massive Meta data center campus, and to the immediate east of it is a Microsoft data center campus under construction)

And that's just a small sample. There are hundreds of these sites across the US. You're somewhat right that a lot of international locations are colocated in places like Equinix data centers, but even then it's not all of them and varies by country (for example in Dublin they mostly all have their own buildings, not colo). If you know where to look and what the buildings look like, the custom-build and self-owned data centers from the big cloud providers are easy to spot since they all have their own custom design.


While the OP is more wrong than right they aren't completely incorrect.

I'm in Australia.

GCP has 2 regions in Australia, Sydney and Melbourne. The Sydney region is in the Equinox DC. Not sure where the Melbourne one is but it isn't a Google-owned facility.

You can see this by comparing Google's Data Center list: https://www.google.com/about/datacenters/locations/ vs their Cloud Location list https://cloud.google.com/about/locations#asia-pacific

Note that the Cloud Locations aren't just "edge": they offer hosting, GPUs etc etc at these locations.


Thank you for your words of support @nl.

I agree with you that @anewplace is clearly taking a very US / North America centric view of the world and arrogantly claiming they know everything and telling me I'm some idiot.

Its very telling that @anewplace has gone quiet and not lecturing you in a condescending manner about how you must have "misunderstood".


[flagged]


Yea, you're not the only "insider" here. And you're 100% wrong. Just because you completely misunderstand what those Amazon/MS employees are doing in those buildings doesn't mean that you know what you're talking about.

The big cloud players have the vast majority of their compute and storage hosted out of their own custom built and self-owned data centers. The stuff you see in colos is just the edge locations like Cloudfront and Cloud CDN, or the new-ish offerings like AWS Local Zones (which are a mix between self-owned and colo, depending on how large the local zone is).

Most of this is publicly available by just reading sites like datacenterdynamics.com regularly, btw. No insider knowledge needed.


This doesn't seem right for GCP.

Compare https://cloud.google.com/about/locations vs https://www.google.com/about/datacenters/locations/

The Cloud locations aren't just edge locations (scroll down on that page and note most have all APIs supported) and there are a lot more of them than there are Google-owned DCs.


[flagged]


Well those people lied to you then, or more likely there was a misunderstanding, because you can literally just look up the sites I mentioned above and see that you're entirely incorrect.

You don't need to be under NDA to see the hundreds of billions of dollars worth of custom built and self-owned data centers that the big players have.

Hell, you can literally just look at their public websites: https://www.google.com/about/datacenters/locations/

I am one of those "pay grades many layers higher", and I can personally confirm that each of the locations above is wholly owned and used by Google, and only Google, which already invalidates your claim that "you can count the wholly-owned sites on one hand". Again, this isn't secret info, so I have no issue sharing it.


I assume they are referring to PoPs, or other locations where large providers house frontends and other smaller resources.

Ignore him.

He's got a tinfoil hat on and won't be persuaded..

> Because I'm not relying on what one person or one company told me, my facts have been diligently and discretely cross-checked.

"Discretely cross-checked" already tells me he chooses to live in his own reality.


[flagged]


I'm not trying to make you divulge anything. I don't particularly care who you talk to, or who you are, nor do I care if you take it as a "personal insult" that you might be wrong.

You are right that it would be nuts that multiple senior people would collude to lie to you, which is why it's almost certainly more likely that you are just misunderstanding the information that was provided to you. It's possible to prove that you are incorrect based on publicly available data from multiple different sources. You can keep being stubborn if you want, but that won't make any of your statements correct.

You didn't ask for my advice, but I'll give it anyway: try to be more open to the possibility that you're wrong, especially when evidence that you're wrong is right in front of you. End of story.


You know it is very much region dependent.

You are correct many facilities are owned by the hyperscalers, and they also extensively use colos for hosting entire regions (not only PoPs), specially outside the US. More recently I’d also include Ireland.

I have worked at two cloud providers very close to the netops teams due to my customers, but I have signed NDAs so I won’t go further into it, specially since one of my ex-employers is very touchy about this subject.


I see anewplace hasn't come down on you (@rescbr) like a ton of bricks like he did for me.

You are basically saying the same thing I did, but with different words.

So anewplace owes me a big-time apology.


> try to be more open to the possibility that you're wrong,

Same to you chum, same to you.

Just because it doesn't look that way from your view of the world, doesn't mean you are right either.

Perhaps just accept we are both right and that you are missing aspects of my context that I cannot talk about due to the sensitive nature of it.


It can be true that all the big clouds/cdns/websites are in all the big colos and that big tech also has many owned and operated sites elsewhere.

As one of these big companies. You've got to be in the big colos because that's where you interconnect and peer. You don't want to have a full datacenter installation at one of these places if you can avoid it, because costs are high; but building your own has a long timetable, so it makes sense to put things into colos from time to time and of course, things get entrenched.

I've seen datacenter lists when I worked at Yahoo and Facebook, and it was a mix of small installations at PoPs, larger installations at commercial colo facilities, and owned and operated data centers. Usually new large installations were owned and operated, but it took a long time to move out of commercial colos too. And then there's also whole building leases, from companies that specialize in that. Outside the US, there was more likely hood of being in commercial colo, I think because of logistics, but at large system counts, the dollar efficiency of running it yourself becomes more appealing (assuming land, electricity, and fiber are available)


It is true that every cloud provider uses some edge/colo infra, but it is also not true that most (or even really any relevant) processing happens in those colo/edge locations.

Google lists their dc locations publicly: https://www.google.com/about/datacenters/locations/

Aws doesn't list the campuses as publicly, but https://aws.amazon.com/about-aws/global-infrastructure/regio... shows the AZ vs edge deployments and any situation with multiple AZs is going to have buildings, not floors operated by Amazon.

And limiting to just outside the US, both aws and Google have more than ten wholly owned campuses each, and then on top of that, there is edge/colo space.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: