How are supercomputer clusters networked together?

roid · Post by **roid** » Fri Jun 06, 2014 11:43 pm

This is just a thought experiment.

Say if i wanted to build a GPGPU cluster, basically a sub-$10,000 GPU based supercomputer distributed across 2-5 PCs, each PC having 4x SLI-linked video-cards chosen solely for their cuda-cores/$ price point.
(I want as many GPU cores as possible, coz it'll theoretical be used for faster-than-real-time processing of 3D pointclouds from RGBD cameras. My back of the envelope calculations suggest i could probably get 24,000 cores* for $6000.)

OK, the question is: What would be the preferred way to connect the PCs together?

I'd initially think some ordinary networking thing like 10-gigabit ethernet, but i don't really need that sort of robustness do i? The PCs are literally side by side and communicate only with eachother, i could probably reduce overheads by using something much more simple than TCPIP.

And why communicate serially? The parallel lines of the PCIE bus are right there staring me in the face, is there an existing and (maybe common/cheap) networking tech that can more directly make use of those lovely lovely parallel lines?
Or maybe you could use the SATA bus? That seems fast.
Hmm, or even better: couldn't you just get the cards to output their data over the video cables themselves? Aren't video cables basically the highest bandwidth cables around? Oh, but i probably won't have an easy way of RECEIVING that data from a video cable, scratch that idea (right?).

So yeah, networking super-computer clusters. How's it done?

*Yes, this video was indeed the initial seed for this post.

Krom · Post by **Krom** » Sat Jun 07, 2014 7:31 am

Super computer clusters are often wired together with plain old ethernet and TCP/IP. Sure the switches are usually bigger and faster (especially for larger clusters) than what you would be using at home, but the concept is pretty much the same. Keep in mind super computer clusters are not particularly bothered by latency, they can crunch numbers incredibly fast with so many nodes but the projects they are working on are not even close to realtime processing.

Your problem would be the realtime requirement, most distributed computing is NOT realtime. It is less a battle against bandwidth (but bandwidth definitely helps) and much more a battle against latency. You also can't expect a quad SLI GPU node with terraflops of compute to be kept even remotely busy by a simple dual core CPU, especially when you are talking about realtime processing. Additionally the machine that coordinates the nodes probably can't be much of a slouch either, a millisecond here and there stacks quite unfavorably with this kind of demand.

The bandwidth requirements might not be that much, the compute power is also easy to come by, but the latency requirement you are asking for is a major undertaking. Also you can forget running windows to do this, windows is NOT a realtime OS (and neither is any common distribution of linux for that matter). Sure, these days when my system is running Descent 3 at 1000+ FPS it is pretty close to realtime, but for a more modern game in the 50-70 fps range the latency of the system is easily around 10-20 MS, completely unacceptable for what you want to do.

From your video, it may not even be that it requires that much compute to do that, it may just be that it requires that much compute to do it in <1 millisecond.

roid · Post by **roid** » Mon Jun 09, 2014 5:20 am

hmm yeah you're right, bandwidth may not be an issue. From the point of view of building a SLAM map from the incomming RGBD data, the GPUs will likely be quite happy doing all the computation internally once they've received each frame, or part-frame since i imagine the data can be split-up quite well. Or perhaps each GPU would be allocated a single frame, integrating it into the whole on it's own time, there being enough GPUs in the mix for no frames to be skipped. The resulting map thus being as up-to-date as the time it takes to process a single frame (plus overheads). I never knew there was such semantics around the phrase "real-time".

It aches my head pondering how distributed computation on large datasets and/or streams works. Yet it's apparently done. *shrug*
I'm not sure in the video how the data is split up for distributed processing.

Regarding this bit:

You also can't expect a quad SLI GPU node with terraflops of compute to be kept even remotely busy by a simple dual core CPU

You wouldn't be putting a dual core in a quad SLI system, but yeah point taken - the more data you need to move, the more CPU power required. Sounds like a good excuse to use a low-bandwidth high-computation unit-based method of distributed computation, like the various @home projects opt for. Perhaps this is how it's done in the video.

One of the reasons i mentioned simplifying (or near-eliminating) the normal networking protocols would be in hope of reducing the CPU's workload, perhaps eliminating the need for the data to sit in ram while it's divvied into packets (if that's how it works).
It's surprising there's apparently never been a need to extend the SLI bus past 4 nodes, seems perfect for networking GPGPU PCs together. Can't they just SLI 3 cards together on each PC, and then the 4th card is connected to a separete INTER-PC SLI chain which connects a single card in each of the 4 PCs. The data travelling from each PC to the next would still need to come through each PC's PCI bus to get into the other 3 GPUs, but still - it seems like an improvement over network cards and serial networking protocol stacks.
So 4 PCs with 5 SLI chains between them. Get on it nVidia, heh.
(oh wait, you could have 5x GPUs in each PC, since there's now an extra slot available in the original SLI chain)
This is terrible, like you said: i feel sorry for that poor CPU trying to shunt the data from one SLI network to another across the PCI bus. Why can't we have a longer SLI bus arghasdkagskjh >:[ Or a secondary SLI link on cards, that'd work too. I guess the problem is that there's no use for this other than for supercomputing, so it's impossible to justify adding to consumer cards.

roid · Post by **roid** » Mon Jun 09, 2014 6:20 am

https://en.wikipedia.org/wiki/AMD_CrossFireX#Current_generation_.28XDMA.29 wrote: ...Instead, they use XDMA to open a direct channel of communication between the multiple GPUs in a system, operating over the same PCI Express bus...

Interesting! Seems to be suggesting that the CPU doesn't even need to be involved (or at least, isn't a limiting factor) for devices to communicate directly to one another over the PCIE bus.
Not nVidia, but still... glad someone's on it.

Jeff250 · Post by **Jeff250** » Mon Jun 09, 2014 10:18 am

Aside from ethernet, infiniband is also commonly used to network supercomputers. It supports cool stuff like remote DMA, which is probably about as good as you could hope for for quickly sharing data between computers.

snoopy · Post by **snoopy** » Mon Jun 09, 2014 1:33 pm

You have an interesting idea.

Genuine "real time" processing is a different world... FPGA's and more direct access into the processors becomes the norm, as opposed to OSes with schedulers.

As Krom points out: at that point, the challenge of real time parallel processing is splitting the input up and splicing the outputs back together quickly and efficiently.

roid · Post by **roid** » Fri Jun 13, 2014 6:24 pm

yeah when initially said "real-time" i was just meaning that it could process the entire stream as it came in, without discarding any packets nor needing to save the stream for later off-line processing. Basically just "before the next data packet comes in". Point cloud data takes a fair bit of processing.

Jeff250 thanks for the mention of Infiniband, pretty fast, but seems like just another serial networking standard. I guess you gotta go serial (with routing protocols and all that overhead) if you're running a switched network eh. i imagine a non-switched network standard would be more likely to be parallel.

How are supercomputer clusters networked together?

How are supercomputer clusters networked together?

Re: How are supercomputer clusters networked together?

Re: How are supercomputer clusters networked together?

Re: How are supercomputer clusters networked together?

Re: How are supercomputer clusters networked together?

Re: How are supercomputer clusters networked together?

Re: How are supercomputer clusters networked together?