About the project

MapReduce is a project funded by the French National Research Agency (ANR), ARPEGE 2010 call. Project number: ANR-10-SEGI-001.

Map-Reduce is a parallel programming paradigm successfully used by large Internet service providers to perform computations on massive amounts of data. The key strength of the Map-Reduce model is its inherently high degree of potential parallelism: it enables processing petabytes of data in a couple of hours, on large clusters consisting of several thousand nodes.

The storage layer is a key component of MapReduce frameworks. To enable massively parallel data processing to a high degree over a large number of nodes, the storage layer must meet a series of speciﬁc requirements: the storage layer is expected to provide efﬁcient ﬁne-grain access to the ﬁles, while sustaining a high throughput under heavy access concurrency.

This project aims to overcome the limitations of current Map-Reduce frameworks such as Hadoop, thereby enabling highly-scalable Map-Reduce-based data processing on various physical platforms such as clouds, desktop grids, or on hybrid infrastructures built by combining these two types of infrastructures.

To meet this global goal, several critical aspects will be investigated:

Data storage and sharing architecture. First, we will explore advanced techniques for scalable, high-throughput, concurrency-optimized data and metadata management, based on recent preliminary contributions of the partners.

Scheduling. Second, we will investigate various scheduling issues related to large executions of Map-Reduce instances. In particular, we will study how the scheduler of the Hadoop implementation of Map-Reduce can scale over heterogeneous platforms; other issues include dynamic data replication and fair scheduling of multiple parallel jobs.

Fault tolerance and security. Finally, we intend to explore techniques to improve the execution of Map-Reduce applications on large-scale infrastructures with respect to fault tolerance and security.

Our global goal is to explore how combining these techniques can improve the behavior of Map-Reduce-based applications on the target large-scale infrastructures. To this purpose, we will rely on recent preliminary contributions of the partners associated in this project, illustrated though the following main building blocks:

BlobSeer, a new approach to distributed data management being designed by the KerData team from INRIA Rennes - Bretagne Atlantique to enable scalable, efficient, fine-grain access to massive, distributed data under heavy concurrency.

BitDew, a data-sharing platform being currently designed by the GRAAL team from INRIA Grenoble - Rhône-Alpes at ENS Lyon, with the goal of exploring the specificities of desktop grid infrastructures.

Nimbus, a reference open source cloud management toolkit developed at the University of Chicago and Argonne National Laboratory (USA) with the goal of facilitating the operation of clusters as Infrastructure-as-a-Service (IaaS) clouds.

Contact

Project coordinator:

Gabriel Antoniu

INRIA Rennes - Bretagne Atlantique

Campus de Beaulieu

35042 Rennes cedex

e-mail: gabriel(dot)antoniu(at)inria(dot)fr