This is pretty awe-inspiring but as a programmer I know it would be fairly difficult to use this machine for existing workloads because so much code would have to be rewritten from typical x86 code to CUDA/OpenCL to use all those GPUs.
Personally, I'm more excited for the next wave of supercomputers built with racks of Xeon Phis [1].
(full disclosure: used to work for NV on CUDA and did very extensive work on Titan, so I am probably biased)
If you think your existing MPI app is going to automatically scale to a heterogeneous architecture (high-power x86 on the main CPU, Xeon Phi cores on the accelerator) and get acceptable performance, sorry, it's not going to happen.
The fundamental constraints on 2012/2013 Xeon Phi performance that determine how apps should be written are exactly the same as current desktop GPUs (small, high-latency local memory that is not coherent with the rest of the system; relatively slow, high-latency link to CPU; ugly interactions with network cards in most environments; fundamental need to hide memory latency at all times). For any sort of performance beyond a standard Xeon, you're going to want to run a Xeon Phi as a targeted accelerator rather than offloading entire processes to it and using a standard MPI stack. This means you're going to be running in a hybrid host/device mode and using compiler directives or a specific parallel language and API to deal with on-chip execution and data transfer, which puts you in exactly the same solution space as with GPUs.
in other words: the Phi of today is not a panacea. you get better tools and more flexibility in terms of the programming model, but the fast path that any of its intended market would use in applications looks identical to GPUs.
GPUs are SIMD machines, so they're executing the same instruction simultaneously on all the active cores. That means if you have code which branches, it has to mask out the cores which follow branch B while it executes branch A; then has to mask out all the cores which follow branch A while it executes branch B. In other words, if at least one core follows each side of the branch, it has to execute both branches.
If all cores branch in the same direction, you don't get that penalty. A large part of optimising for the GPU comes down to arranging your data and code so that this can happen.
GPUs suck at any problem that cannot be easily divided. If you can map a function over arbitrary chunks of self-contained data GPUs will perform better.
In fairness, I don't think there are any existing workloads that would benefit 299k cores that aren't already massively parallel :) If you need this kinda thing, your code is already going to be ready for it.
I'm with you on the Phis though. Can't wait for one to be affordable for a home machine
Does anyone know why they have a separate disk IO system when they could more easily just plug drives into each node/motherboard for higher aggregate throughout, less complexity, and a lower overall cost?
EDIT: Blade systems or no, the drives have to physically be placed somewhere. Having a separate subsystem can only take up more space, not less. Two reasons I can think of: (1) independent scaling of compute and storage, and (2) lack of software for a distributed filesystem. Most likely (2) plus inertia is the real reason, all the others seem like rationalizations. For example, they are either able to take nodes offline or they aren't. The need exists whether or not the disks are attached there.
- Jitter. Others have mentioned it, but I'll repeat it. These systems run carefully tuned microkernels to try to avoid any spurious interrupts, placing disk storage directly in the system would cause a lot of problems. Most applications require a careful lock-step progression to keep the calculations relevant; having one node take even 1% longer on a step wastes tremendous amounts of compute power for the system as a whole.
- System lifecycle management. As mentioned in the article, supercomputers are only cost effective to run for ~ 4 years; but data storage can outlast that. De-coupling storage and compute makes the transitions a bit easier; you can still get at the old filesystem even when the last generation machine has been scrapped. Also, this helps with access from related-but-disparate systems - you do need to get the data in and out of the machine to do anything useful with it, and potentially interrupting or degrading compute performance for external file access would be a problem.
- Power, and power stability - filesystems, especially large distributed systems such as Lustre/GPFS/PVFS2, do NOT handle power loss well. Best current practice for HPC centers is to keep the storage subsystems, file servers and disk arrays on backup power, but the compute side is directly run from the grid. Embedding disk in the compute would either require UPS'ing the compute platform, or anticipating filesystem corruption.
As with pretty much everything in the industry, people are looking at approaches to solve these. There is some inertia as you speculated, but it's becoming obvious that I/O is the new bottleneck as systems scale up any further, and you'll likely see some storage start moving in closer to the compute system.
Common programming model of parallel supercomputers expect something like one shared filesystem across all nodes (used for input and output and checkpointing and such, almost never as temporary per-node storage). And most readily available and scalable enough implementations of that expect exactly this architecture as it is easier to implement in software, reason about it's performance and also significantly cheaper and easier to maintain.
Distributed filesystem in the GFS (assuming you mean RedHat's GFS2) sense, like GFS itself, SGI's CXFS, Apple's Xsan or even clustering support in almost all Enterprise DBMS's, does not mean that the data itself is distributed, only that the filesystem (or DB) logic is distributed and only shared component is single central (eg. FC or iSCSI attached) dumb block device without any special understanding of the filesystem, which is exact opposite of using local hard drives of the nodes.
Things like Lustre or GoogleFS use distributed storage nodes that are normally separate from client nodes.
While in both cases nothing prevents you from running both client applications and storage server process on same node there is little reason to do so, for both operational (disks are in one place) and economic reasons (attaching 20 disks to one node is cheaper than attaching 20 disks to 20 nodes).
In fact few years ago I've actively searched for filesystem with built-in redundancy and storage capacity distributed across client nodes and found almost nothing that was designed for such use case.
Only thing I'm aware of that was close to what you propose is/was essentially an wrapper for NFS in (open)MOSIX, that was able to redirect I/O from process to local disk when it found out that it was migrated to same node as NFS server runs on. With the important note that (open)MOSIX is not exactly meaningful HPC solution, but interesting nevertheless.
Distributed filesystems are designed with commodity hardware and traditional data-centers in mind. Why SHOULD it be distributed across local drives on all nodes? These are computationally intensive workloads (like supercomputers), not storage I/O intensive workloads (like mainframes or Hadoop clusters). If you're managing hardware on this scale and you don't need local drives, why on earth wouldn't you put your storage in a separate, central location?
> Does anyone know why they have a separate disk IO system
It would be very interesting if they could describe all their general design decisions, such as this
> ... when they could more easily just plug drives into each node/motherboard for higher aggregate throughout, less complexity, and a lower overall cost?
doesn't the fact that they haven't put the drives on the compute nodes make you question your claim that it would have been 'easier' and 'better' and 'cheaper'?
There are a couple of (related) reasons. The first thing to recognize is that systems such as this one are designed for parallel workloads where all processes are running in lockstep, communicating via MPI with frequent barriers. This is very different from MapReduce and other asynchronous or "embarrassingly parallel" workloads where GFS, HDFS, etc. tend to be used. Distributed filesystems used in high-performance computing (such as Lustre, IBM's GPFS, etc.) also have to be able to handle both reads and writes with high throughput, whereas GFS is mostly optimized for reads and appends.
Why not just install disks in the compute nodes and run Lustre there? Since all the nodes are working together in lockstep, system jitter is a major problem. Imagine that you have a job running across 10,000 nodes and 160,000 cores, and a process on one of those cores get preempted for a millisecond while a disk I/O request is being serviced. Everyone waits, and you've suddenly wasted 160 core-seconds. Now, if this happens only 1000 times per second across the whole machine, it's clear that you're not going to make much forward progress, and the whole system is going to run at very low efficiency. For this reason, Crays and similar large machines run a very minimal OS on the compute nodes (a linux-based "compute node kernel" in the case of Cray). Introducing local disks would go against the whole philosophy.
There's also the issue of network contention. The network is typically the bottleneck, and you want to minimize the extent to which file I/O competes with your MPI traffic.
As someone else mentioned, the solution is to have a dedicated storage system (often Lustre running on a semi-segregated cluster). This approach is used almost universally by the 500 systems on the Top 500 list (http://top500.org), for example. It's not just inertia :-).
Disk io has negligible CPU overhead. Preempted for a millisecond? A millisecond is millions of instructions. You're off by orders of magnitude. No matter where the disk is located, the disk io has to go across the network. If network capacity truly is the bottleneck, you have a different design problem and you cant exploit the CPUs.
EDIT: I still dont buy it, but I will give some thought to the synchronous/lockstep nature of the environment.
That was a straw man (sorry). There's also the overhead of maintaining consistency, synchronizing metadata, etc. I don't think assuming 0.1% CPU overhead for Lustre is a terrible estimate, but even if it were much lower, the argument would still hold (at least at the scale of Titan).
My guess is that most of the work they do on this machine won't be bottlenecked on the hdds, so they don't worry about it too much. And it's easier to replace hard drives in a separate rack than taking blades offline to do it.
Personally, I'm more excited for the next wave of supercomputers built with racks of Xeon Phis [1].
[1] - http://www.intel.com/content/www/us/en/high-performance-comp...