Adaptive Pricing and Coding for Dispersed Computing

This material is based upon work supported by Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001117C0053. Any views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

  PI: Professor Salman Avestimehr
Co-PI's: Professor Bhaskar Krishnamachari and Professor Murali Annavaram
 

Project description

Recent years have witnessed a rapid growth of computationally intensive applications on users' devices for military operations. In most of these application domains, much of the needed data is collected at the edge devices (e.g., sensors, mobile platforms, and users' computers). In the conventional approaches, this data is then channelled through the network to the back-end servers for computations. However, there is a significant amount of computing capability that is embedded along the entire path from edge to the cloud. Hence, exploiting these "dispersed computing capabilities" is not only beneficial for performance and power, but it is also required in the context of military operations, where uncertainty in available bandwidth and latency in decision making are critical considerations. In addition to the traditional challenges that exist for distributed computing over the cloud (e.g., resource allocation, task scheduling, and data-distribution), we face three key challenges in dispersed computing: limited and variable bandwidth, computing and network heterogeneity, and network dynamics. In this project, we combine theoretical advances with real system implementation to provide a comprehensive dispersed computing framework that tackles these three challenges with the following features:
  1. We develop an innovative "decentralized task scheduling framework and pricing-based algorithms" for resource allocation in dispersed computing. In contrast to traditional centralized schedulers for cloud distributed computing as well as grid computing environments, the decentralized task scheduling framework that we propose will reduce the coordination overhead associated with constantly collecting fresh state information about computation and communication from all NCP's, and this will make the dispersed computing system not only more scalable but also more responsive in case of network failures or degradation under attack. Our pricing-based approach to resource allocation will allow dynamic optimization of task assignments for competitive jobs, providing support for job-prioritization and deadline-satisfaction.
  2. We develop a novel "Coded Dispersed Computing (CDC)" framework, which provides a completely new architecture for leveraging the available or under-utilized computing power at various parts of the network, in order to enable coding opportunities that can significantly reduce the bandwidth consumption and latency of dispersed computing. We also develop a novel type of coding, named maximum robustness codes, that provide robustness to maximum number of failing or straggling nodes. Compared to today's uncoded computing systems, CDC significantly improves bandwidth utilization and resilience to failures and stragglers.
  3. We design and deploy a "heterogeneous dispersed computing testbed at USC" that will emulate real world scenarios where video data is collected across distributed sensor nodes. The testbed will be a unique fusion of ultra-low power NCPs, integrated with sophisticated video and audio collection sensors. On the networking front the testbed will support heterogeneous communication protocols. We develop a full suite of software capabilities that expose the underlying heterogeneous testbed capabilities in a unified manner so that the applications can exploit the testbed transparently. The in-house testbed will be complemented with a commercial dispersive computing testbed, albeit with limited heterogeneity, where we demonstrate our approaches at scale.
Please see this news article for a high-level description of the project:  

Publications

 

Software

CIRCE is a runtime scheduling software tool for dispersed computing, which can deploy pipelined computations described in the form of a directed acyclic graph (DAG) on multiple geographically dispersed computers (compute nodes). DRUPE is a tool to collect information about computational resources as well as network links between compute nodes in a dispersed computing system to a central node. DRUPE consists of a network profiler and a resource profiler. Utilizing in-network coding to trade extra computations for more communication bandwidth in a distributed sorting algorithm (TeraSort). The paper for this project is available at https://arxiv.org/abs/1702.04850.  

Raspberry Pi cluster

A Raspberry Pi edge cloud will be used to deploy our software.