Big heterogeneous computer systems, especially forthcoming exascale computers, are power hungry and difficult to program effectively. This is, of course, not an unrecognized problem. In a recent blog, Charmworks’ CEO Sanjay Kale makes the case for smart runtime systems generally, and Charmworks specifically, as needed tools for taming power consumption challenges in a performant manner.
Power consumption, wrote Kale, “[is] such a large concern that the Department of Energy Science for the Future Act that was passed by the House of Representatives in July explicitly calls for an Energy Efficient Computing Program to be established by the DOE. The bill describes it as ‘a program of fundamental research, development, and demonstration of energy efficient computing technologies relevant to advanced computing applications in high performance computing, artificial intelligence, and scientific machine learning,’ and it says there should be partnerships among the national labs, industry, and higher-ed to co-design energy efficient hardware, software, and applications.”
Charm++, you may know, is one of several existing tools targeting large heterogeneous HPC systems. “With Charm++ and related software systems, you can optimize performance by way of continuous introspection – constantly and automatically assessing the performance of the computation and changing or reconfiguring it to improve that performance,” says Kale.
Here’s a summary description: “Charm++ is a parallel object-oriented programming paradigm based on C++ and developed in the Parallel Programming Laboratory at the University of Illinois at Urbana–Champaign. Charm++ is designed with the goal of enhancing programmer productivity by providing a high-level abstraction of a parallel program while at the same time delivering good performance on a wide variety of underlying hardware platforms. Programs written in Charm++ are decomposed into a number of cooperating message-driven objects called chares. When a programmer invokes a method on an object, the Charm++ runtime system sends a message to the invoked object, which may reside on the local processor or on a remote processor in a parallel computation. This message triggers the execution of code within the chare to handle the message asynchronously.”[i] (Short primer on Charm++)
While focused on Charm++, Kale’s blog notes important approaches embodied in several program-and-runtime approaches for coping with power consumption and effective execution on large heterogeneous computers. Below is an excerpt from Kale’s ‘Q&A’ blog with a link to the full blog at the end.
What attributes of its programming model allows Charm++ to reconfigure the application while it is running?
Charm++ has three main attributes. The first is overdecomposition, in which the programmer divides an application’s computation requirements into many relatively small objects, each representing a coarse work and/or data unit. The number of such objects greatly exceeds the number of processors. The second attribute is migratability, which is the ability to move these objects among processors. This means the user addresses their communication (i.e. messages) to the logical objects, rather than to physical processors. This gives the runtime system the ability to move these objects across nodes and processors as it sees fit. The third attribute is message-driven execution, which allows the system to select which of objects will run next based on availability of messages. These three attributes enable Charm++ to provide many useful features including dynamic load balancing, fault tolerance, and job malleability.
How does that help with runtime adaptation?
Take load balancing, for example. Any time the division of work among processors is not uniform, you have a load imbalance. If one processor takes longer than the others to complete its part, all others are held up as they wait to synchronize. This waiting leads to inefficiency and saps performance. That inefficiency can change and increase dramatically as the application evolves, for some applications.
Automatically assess and address that imbalance and you’re using power more efficiently, getting results faster, and/or improving the resolution of your simulation in the same amount of time.
Charm++ relies on the principle-of-persistence heuristic. This principle states that, for overdecomposed iterative applications, the task’s or object’s computation load and communication pattern tend to persist over time. The heuristic uses the application’s load statistics collected periodically by the runtime system, which provides an automatic, application-independent way of obtaining load statistics without any user input. If desired, the user can specify predicted loads and thus override system predictions. Using the collected load statistics, Charm++ periodically executes a load-balancing strategy to determine a better objects-to-processors mapping and then migrates objects to their new homes accordingly. Its suite of load balancers includes several centralized, distributed, and hierarchical strategies. Charm++ can also automate the decision of when to call the load balancer, as well as which strategy to use, based on observed application characteristics using a machine-learning model.
But notice that this whole capability was possible because the programming model supported overdecompostion and migratibility.
Link to full Kale blog: https://www.hpccharm.com/post/the-power-of-charm-and-runtime-systems[i] https://en.wikipedia.org/wiki/Charm%2B%2B