ScaleML is an ERC-funded research project which develops new abstractions, algorithms and system support for scalable machine learning. As most machine computation is parallel or distributed, increasing computational demands place immense pressure on algorithms and systems to scale, revealing the performance limits of current distributed computing paradigms. Thus, the question of how to build scalable algorithms and systems for machine learning is extremely pressing.
Approach. In a nutshell, the line of approach is elastic coordination: allowing machine learning algorithms to approximate and/or randomize their synchronization and communication semantics, in a structured, controlled fashion, to achieve scalability. The project exploits the insight that many machine learning algorithms are inherently stochastic, and hence robust to inconsistencies resulting from relaxed coordination. The thesis is that elastic coordination can lead to significant, consistent performance improvements across a wide range of applications, while guaranteeing provably correct executions.
Applications. ScaleML applies elastic coordination to two specific relevant scenarios: scalability inside a single multi-threaded machine, and scalability across networks of machines. Conceptually, the project’s impact is in providing a set of new design principles and algorithms for scalable computation. From the practical perspective, it develops these insights into a set of useful tools and working examples for scalable distributed machine learning.