especificación, diseño e implementación de un entorno de … · 2019-05-24 · programación...

Di r ecci ó n:Di r ecci ó n: Biblioteca Central Dr. Luis F. Leloir, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Intendente Güiraldes 2160 - C1428EGA - Tel. (++54 +11) 4789-9293

Co nta cto :Co nta cto : [email protected]

Tesis Doctoral

Especificación, diseño eEspecificación, diseño eimplementación de un entorno deimplementación de un entorno de

programación concurrente basado enprogramación concurrente basado enpatronespatrones

Pérez, Gervasio Daniel

2018

Este documento forma parte de la colección de tesis doctorales y de maestría de la BibliotecaCentral Dr. Luis Federico Leloir, disponible en digital.bl.fcen.uba.ar. Su utilización debe seracompañada por la cita bibliográfica con reconocimiento de la fuente.

This document is part of the doctoral theses collection of the Central Library Dr. Luis FedericoLeloir, available in digital.bl.fcen.uba.ar. It should be used accompanied by the correspondingcitation acknowledging the source.

Cita tipo APA:

Pérez, Gervasio Daniel. (2018). Especificación, diseño e implementación de un entorno deprogramación concurrente basado en patrones. Facultad de Ciencias Exactas y Naturales.Universidad de Buenos Aires. http://hdl.handle.net/20.500.12110/tesis_n6506_Perez

Cita tipo Chicago:

Pérez, Gervasio Daniel. "Especificación, diseño e implementación de un entorno deprogramación concurrente basado en patrones". Facultad de Ciencias Exactas y Naturales.Universidad de Buenos Aires. 2018. http://hdl.handle.net/20.500.12110/tesis_n6506_Perez

http://digital.bl.fcen.uba.ar

http://digital.bl.fcen.uba.ar

http://hdl.handle.net/20.500.12110/tesis_n6506_Perez

http://hdl.handle.net/20.500.12110/tesis_n6506_Perez

mailto:[email protected]

Universidad de Buenos AiresFacultad de Ciencias Exactas y Naturales

Departamento de Computación

Especificación, diseño e implementación de unentorno de programación concurrente basado en

patrones

Tesis presentada para optar al título de Doctor de la Universidad de BuenosAires en el área Ciencias de la Computación

Gervasio Daniel Pérez

Director de tesis: Dr. Sergio Fabián YovineConsejero de estudios: Dr. Victor BrabermanLugar de trabajo: Departamento de Computación

Facultad de Ciencias Exactas y NaturalesUniversidad de Buenos Aires

Buenos Aires, 2018Fecha de defensa: 18 de Abril de 2018

Especificación, diseño e implementación de un entorno deprogramación concurrente basado en patrones

Resumen: La tarea de desarrollar software paralelo que sea eficiente, lógi-camente correcto y económico desde un punto de vista de costo-beneficio escompleja. Diversos problemas potenciales pueden causar comportamientos in-correctos y performance pobre. El diseño de software basado en patrones puedeayudar a cubrir los requisitos de correctitud y de escalabilidad. Sin embargo,éste tiene diversos problemas: (a) muchos de los patrones no están disponibles enlos modelos y lenguajes actuales de programación paralela; (b) frecuentementees difícil implementar y utilizar el patrón más apropiado para el problema a re-solver; y (c) muchos patrones no proveen fácilmente la capacidad de componerseentre sí, dificultando el soporte para procesamiento paralelo sobre arquitecturasheterogéneas.

Con el objetivo de ayudar a solventar estas deficiencias, las contribucionesde esta tesis son tres. Primero, se presenta un nuevo patrón genérico de proce-samiento paralelo llamado PCR [43], que consiste en una combinación de elemen-tos productores, consumidores y reductores que operan de manera concurrentesobre un conjunto de datos de entrada. Se provee una definición matemáticade la semántica de dicho patrón en términos del formalismo FXML. Se muestraademás que los PCRs pueden componerse y comprenden otros patrones de progra-mación paralela conocidos, proveyendo así un entorno compatible con diseñosheterogéneos. Segundo, se muestra formalmente que el patrón PCR puede serimplementado de manera correcta en términos de un modelo de concurrenciamás concreto a modo de implementación. Tercero, se provee una librería multi-plataforma de plantillas C++ que permite expresar instancias concretas del patrónPCR. Se presenta un prototipo de compilador basado en reescritura de plantillasque genera de manera automática implementaciones paralelas y distribuídas dePCRs basándose en la librería de concurrencia Intel Concurrent Collections. Elentorno de programación y generación de código se ilustra con diversos casos deestudio.

En resumen, el entorno propuesto provee medios para mejorar la calidady productividad del desarrollo de software paralelo mediante una metodologíabasada en construcciones de programación de alto nivel e independientes de laplataforma y una infraestructura de compilación para generar código ejecutableportable.

Palabras clave: modelos formales de concurrencia, paralelismo es-tructurado, generación de código basada en transformaciones

Specification, design and implementation of apattern-based concurrent programming environment

Abstract: Developing correct and efficient parallel software in a cost-effec-tive way is challenging. There are a number of pitfalls that lead to incor-rect behaviors and poor performance. Pattern-based software design could helpachieving correctness and scalability. However, it has several drawbacks: (a)most patterns are not broadly supported by current parallel-programming mod-els and languages; (b) most often than not getting the appropriate pattern rightis difficult; and (c) most patterns do not compose easily, thus making it hard todeal with heterogeneous parallelism.

As an attempt to overcoming these issues, the contribution of this thesisis threefold. First, it proposes a parallel-programming pattern, called PCR [43],consisting of producers, consumers, and reducers which operate concurrentlyon data sets. To favor correctness, the semantics of PCRs is mathematicallydefined in terms of the formalism FXML. PCRs are shown to be composable andto seamlessly subsume other well-known parallel-programming patterns, thusproviding a framework for heterogeneous designs. Second, it formally showshow the PCR pattern can be correctly implemented in terms of a more concreteparallel execution model. Third, it proposes a platform-agnostic C++ templatelibrary to express PCRs. It briefly presents a prototype compiler based on C++

template re-writing which automatically generates distributed implementationsrelying on the Intel Concurrent Collections C++ library. The programming andcode-generation suite is illustrated through several case studies.

Overall, the proposed framework provides means to enhance parallel soft-ware quality and productivity through an automated methodology based onhigh-level, platform-independent programming constructs, and a compiling in-frastructure to generate portable, executable code.

Keywords: formal concurrency models, structured parallelism,transformation based code generation

Agradecimientos

A mis padres Daniel y Beatriz que, aún sin haber seguido estudios univer-sitarios, siempre nos apoyaron desde lo económico y afectivo en la continuaciónde nuestros estudios de grado. En particular a mi papá que ya no está entrenosotros.

A Miguel y Liliana y al resto de mi familia política por darme apoyo detodo tipo; en particular a Liliana por motivarme a seguir el camino académicoy ayudarme a dar los primeros pasos, tanto académicos como burocráticos.

A Silviana, Gabriel y a los integrantes de FIUBA con los cuales compartí ycomparto experiencias académicas y profesionales.

A todos los docentes de las diversas etapas de mi educación pública quemostraron vocación por la educación y me motivaron a seguir adelante. Enparticular a todo el plantel docente del DC-FCEyN-UBA gracias a los que améla carrera de Ciencias de la Computación, lo suficiente incluso para decidirejercer la docencia y embarcarme en un Doctorado en el mismo Departamento.También a todos mis compañeros de docencia a lo largo de ya 9 años en elDepartamento.

A Sergio que me guió en el Doctorado y fue esencial para que llegue a buenpuerto, mostrándome la dirección a seguir en los momentos de mas desori-entación.

A Hernán, Benoit y Nazareno, los jurados que aceptaron evaluar esta Tesis yproveerme del feedback para conseguir que este trabajo resulte el mejor posible.

A Edu, Juli, Mariano, Alvaro y otros amigos de la vida que me acompañarondurante todo el trayecto.

A Esteban, Ferto, Hernán, Dipi, Diego, Daniela, Guido, Mariano, Ezequiel,Rodrigo, Fernán, Victor, Sebastián y muchos otros amigos y compañeros deLaFHIS con los que compartí distintas etapas de mis estudios de Doctorado.

A Laura, mi compañera, sin la que nada de esto hubiera sido posible bajoningún concepto. Y por supuesto a Joaquín, que con los movimientos desdesu panza me recuerda que tengo por delante un proyecto infinitamente másambicioso que cualquier Doctorado.

v

Contents

1. Resumen en Español 31.1. Capítulo 2 - Introducción . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1. Limitaciones de la programación secuencial . . . . . . . . 31.1.2. Desarrollo de software paralelo . . . . . . . . . . . . . . . 41.1.3. Contribuciones . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2. Capítulo 3 - Conceptos preliminares . . . . . . . . . . . . . . . . 51.2.1. Programación paralela basada en patrones . . . . . . . . . 51.2.2. Especificaciones paralelas de FXML . . . . . . . . . . . . 6

1.3. Capítulo 4 - El patrón Producir–Consumir–Reducir . . . . . . . . 61.3.1. Presentación informal . . . . . . . . . . . . . . . . . . . . 61.3.2. Definición formal de PCRs . . . . . . . . . . . . . . . . . 81.3.3. PCRs como funciones matemáticas . . . . . . . . . . . . . 8

1.4. Capítulo 5 - Extensiones de PCR . . . . . . . . . . . . . . . . . . 91.4.1. Paralelismo recursivo . . . . . . . . . . . . . . . . . . . . . 91.4.2. Redes de PCR . . . . . . . . . . . . . . . . . . . . . . . . 101.4.3. Feedback Loops . . . . . . . . . . . . . . . . . . . . . . . . 111.4.4. Información de ejecución de PCR . . . . . . . . . . . . . . 13

1.5. Capítulo 6 - Casos de estudio . . . . . . . . . . . . . . . . . . . . 141.6. Capítulo 7 - Un modelo concreto de ejecución para los PCRs . . 16

1.6.1. El modelo Concurrent Collections . . . . . . . . . . . . . 161.6.2. Traducción de las especificaciones de PCR a CnC . . . . . 171.6.3. Corrección semántica . . . . . . . . . . . . . . . . . . . . . 20

1.7. Capítulo 8 - Una librería C++ para desarrollar PCRs . . . . . . 211.7.1. Interfaz C++ de programación genérica de PCRs . . . . . . 211.7.2. Generación de código CnC ejecutable para PCRs . . . . . . 23

1.8. Capítulo 9 - Evaluación experimental . . . . . . . . . . . . . . . . 241.8.1. Metodología . . . . . . . . . . . . . . . . . . . . . . . . . . 251.8.2. Análisis de rendimiento básico . . . . . . . . . . . . . . . 251.8.3. Estudio de rendimiento de N-Queens . . . . . . . . . . . . 271.8.4. Estudio de rendimiento de Count-Words . . . . . . . . . . 281.8.5. Uso de PCR y CnC en la enseñanza . . . . . . . . . . . . 30

1.9. Capítulos 10 y 11 - Consideraciones finales . . . . . . . . . . . . . 311.9.1. Trabajos relacionados . . . . . . . . . . . . . . . . . . . . 311.9.2. Conclusiones y trabajos futuros . . . . . . . . . . . . . . . 33

Resumen de contribuciones . . . . . . . . . . . . . . . . . 33Desafíos futuros . . . . . . . . . . . . . . . . . . . . . . . 34

vii

viii CONTENTS

I Prelude 35

2. Introduction 372.1. The limits of single-core computing . . . . . . . . . . . . . . . . . 372.2. Parallel software development . . . . . . . . . . . . . . . . . . . . 382.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.4. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3. Preliminaries 413.1. Pattern-Based Parallel Computing . . . . . . . . . . . . . . . . . 413.2. FXML Parallel Specifications . . . . . . . . . . . . . . . . . . . . 44

II Design of a Concurrent Pattern 47

4. The Produce–Consume–Reduce Pattern 494.1. Informal presentation . . . . . . . . . . . . . . . . . . . . . . . . 494.2. PCR Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.1. Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.2. Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Producer . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Consumer . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Iterator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54PCR nesting . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.3. PCRs as mathematical functions. . . . . . . . . . . . . . . 56

5. PCR Extensions 575.1. Recursive parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 575.2. PCR networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.3. Feedback Loop Computations . . . . . . . . . . . . . . . . . . . . 615.4. PCR Execution Hints . . . . . . . . . . . . . . . . . . . . . . . . 62

6. Case studies 656.1. Low Pass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.2. Backwards Reachability . . . . . . . . . . . . . . . . . . . . . . . 666.3. Count Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.4. N-Queens Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 676.5. ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.6. Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 70

III Implementation of PCRs 73

7. A Concrete Execution Model for PCRs 757.1. The Concurrent Collections Model . . . . . . . . . . . . . . . . . 757.2. Translating PCR specifications Into CnC . . . . . . . . . . . . . 76

7.2.1. Semantic Correctness . . . . . . . . . . . . . . . . . . . . 81

CONTENTS ix

8. A C++ Template Library for PCRs 838.1. Implementation Agnostic PCR C++ API . . . . . . . . . . . . . 838.2. Platform Dependant Target Code Generation . . . . . . . . . . . 86

9. Experimental evaluation 939.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.2. Basic performance analysis . . . . . . . . . . . . . . . . . . . . . 939.3. N-Queens performance study . . . . . . . . . . . . . . . . . . . . 959.4. Count-words performance study . . . . . . . . . . . . . . . . . . . 979.5. PCR and CnC usage in teaching . . . . . . . . . . . . . . . . . . 98

IV Discussion 103

10.Related work 105

11.Conclusions and Future Work 10911.1. Summary of contributions . . . . . . . . . . . . . . . . . . . . . . 10911.2. Future challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Appendices 111

A. FXML Parallel Specifications 113A.1. Abstract syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113A.2. Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

A.2.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 116A.2.2. Semantic rules . . . . . . . . . . . . . . . . . . . . . . . . 117

B. Extended Proof Sketches 121B.1. Functional correctness of the CnC translation of PCRs. . . . . . . . 121

CONTENTS 1

Capítulo 1

Resumen en Español

1.1. Capítulo 2 - Introducción1.1.1. Limitaciones de la programación secuencial

El diseño del hardware ha pasado de tratar de mejorar la velocidad de unsólo procesador a aumentar la cantidad de núcleos de procesamiento disponiblesdentro de un mismo procesador. Múltiples problemas relacionados con la físicadel diseño de procesadores hicieron que la industria del hardware siguiera estadirección [5]. La Figura 1.1 ilustra el progreso realizado desde 1970 hasta el año2010 en tecnología de CPU.

Figura 1.1: Generaciones de CPU de Intel y su evolución en velocidad de reloj ynúmero de transistores (fuente: [31]).

Justo después del cambio de siglo se alcanzó un pico en la velocidad netadel procesador por núcleo, y la industria siguió la alternativa de agrupar múl-tiples CPUs en el mismo chip de hardware para seguir aumentando la potenciainformática total disponible.

Con respecto al software, el problema de aprovechar de manera eficiente lapotencia de procesamiento en paralelo es un desafío. Los sistemas operativos han

3

4 CAPÍTULO 1. RESUMEN EN ESPAÑOL

utilizado todos los núcleos de CPU disponibles para ejecutar múltiples tareasal mismo tiempo ya desde los comienzos de la computación multinúcleo. Sinembargo, esta técnica no aborda el escenario frecuente de una sola tarea queejecuta un cálculo costoso y cuyo rendimiento podría mejorarse si su trabajo sereparte entre múltiples núcleos de CPU y/o nodos de computación distribuidos.

Muchos factores añaden una complejidad adicional a la ya difícil cuestiónde desarrollar software concurrente correcto y eficiente de una manera rentable.Además de los conocidos errores de la programación paralela, como deadlocks,condiciones de carrera, etc. [40], el cambio de paradigma hacia múltiples núcleosdemanda hacer uso de diferentes patrones paralelos y modelos de ejecución, asícomo del código heredado existente que no siempre se puede volver a escribirdesde cero [11].

1.1.2. Desarrollo de software paraleloUn aspecto clave para manejar la complejidad de programación paralela es

el uso de descripciones abstractas de los programas. De acuerdo con [9], lasmetodologías de programación paralela se pueden categorizar por su nivel deabstracción de la siguiente manera. Las primitivas de gestión y sincronizaciónde subprocesos junto con los mecanismos de comunicación entre procesos (IPC)(por ejemplo, MPI [30]) se consideran abstracciones de nivel bajo. Las exten-siones a lenguajes (p. Ej., OpenMP [17] y Cilk [10]) y los frameworks (p. Ej.,TBB [46], TPL [38] y CnC [12] ) proporcionan un nivel de abstracción mediosimplificando parte pero no todo el trabajo de coordinación y sincronización.De manera similar, podríamos argumentar que los lenguajes de programaciónparalelos emergentes como X10 [47] y Chapel [13] pertenecen a esta catego-ría. 1 El diseño de software paralelo basado en patrones [41, 16, 27], tambiénconocido como paralelismo estructurado o esqueletos algorítmicos, proporcionaun nivel alto de abstracciones de programación paralelas. Consiste en utilizarconstrucciones comunes que ocultan al programador todos los mecanismos decoordinación y sincronización de bajo nivel que son necesarios para realizarla ejecución paralela concreta. Esto se hace mapeando, en compilación y/o entiempo de ejecución, las abstracciones de alto nivel en bibliotecas de nivel me-dio/bajo y/o construcciones del lenguaje base. Se ha demostrado que recurriral paralelismo estructurado permite aprovechar la potencia informática de lasarquitecturas heterogéneas [26].

1.1.3. ContribucionesLa situación antes mencionada crea una necesidad de técnicas y herramien-

tas que faciliten la construcción de software paralelo de una manera rentable.Esta tesis contribuye en esa dirección siguiendo un enfoque práctico basado enmodelos teóricos. Citando a A. Ranta, esto significa que “el objetivo final esescribir programas [paralelos] que funcionen, pero la mejor manera de lograrloes mediante el pensamiento teórico” [45].

Más precisamente, nuestro trabajo se basa en dos principios. En primer lugar,el software paralelo debe diseñarse de una manera independiente de la platafor-ma, de modo que la misma pieza de software pueda terminar ejecutándose en

1Una discusión en profundidad de los lenguajes de programación paralelos está fuera delalcance de esta tesis.

1.2. CAPÍTULO 3 - CONCEPTOS PRELIMINARES 5

un servidor de muchos núcleos, un grupo de nodos de bajo costo o una matriz(grid) de procesadores. En segundo lugar, el ajuste para un modelo de ejecuciónconcreto debe hacerse traduciendo formalmente dicho diseño independiente dela plataforma en una solución específica. Las características específicas de laplataforma y del entorno deben tenerse en cuenta en las fases relevantes de unproceso de generación de código formalmente fundamentado. Confiar en basesteóricas sólidas garantiza la corrección del código paralelo obtenido.

Siguiendo esta filosofía, este trabajo comienza definiendo un patrón de pro-gramación paralelo, llamado PCR [43], que describe los cálculos realizados simul-táneamente al comunicar Producers, Consumers, y Reducers, cada uno de ellospudiendo ser una función básica (lógica del negocio) o un PCR anidado. Combinaen un patrón único y composable varios conceptos de programación paraleloscomo colectivos [30], cálculos de eureka [34], iteración ilimitada y recursión, yprogramación basada en flujos [48]. La semántica de los PCRs se formaliza uti-lizando FXML [6, 54], que no se basa en ningún modelo de ejecución paralelaconcreta, lo que permite múltiples implementaciones de un programa. Tambiénmostramos que los PCRs se comportan como funciones en el sentido matemático,lo que asegura poder componerlos fácilmente. Con casos de estudio relevantes,ilustramos cómo los PCRs pueden facilitar la programación paralela en la práctica.Como segundo paso, proponemos una traducción formal sólida y completa dePCRs hacia un modelo ejecutable paralelo, en particular, Colecciones Concurren-tes (CnC) [12]. Para completar la contribución, desarrollamos una herramientade generación de código que abarca un motor de reescritura de plantillas paratraducir los PCRs a implementaciones basadas en CnC; manteniendo la abstracciónentre el concepto PCR y el modelo CnC con vistas a poder proveer implementacio-nes alternativas y/o heterogéneas en el futuro.

1.2. Capítulo 3 - Conceptos preliminares

1.2.1. Programación paralela basada en patronesEl campo de la programación paralela basada en patrones nació con el li-

bro de Cole [16]. Éste presenta y motiva la idea de funciones de alto ordenque abstraen el entorno de ejecución paralelo del programador; proporcionandoimplementaciones en una plataforma grid de cuatro patrones comunes de pro-gramación paralela. Al mismo tiempo que el enfoque de Cole, Kung[37] propusootros patrones de computación paralela en el contexto de matrices unidimensio-nales de procesadores.

Las siguientes dos décadas vieron muchas contribuciones al campo. Dar-lington et al. [18] contribuyeron con el Lenguaje de Coordinación Estructurada(SCL) combinando un conjunto de esqueletos algorítmicos predefinidos con se-mántica de programación funcional, lo que permite optimizar transformacionescomo fusión y distribución de operaciones de mapeo; su trabajo se centró enFortran + MPI como plataformas de ejecución. Casi al mismo tiempo, Baccipropuso el lenguaje de coordinación basado en el esqueleto P3L [7] dirigido allenguaje C y la plataforma MPI. La plataforma STAPL[56] ha estado en desarro-llo desde 1998 y se propone como una posible sustitución adaptativa paralela dela biblioteca de plantillas estándar de C++. Remitimos al lector a[27, 41] paraestudios en profundidad de la historia de esqueletos algorítmicos y de patrones


de ejecución paralela.

1.2.2. Especificaciones paralelas de FXMLComo introducción, proporcionamos de manera informal la sintaxis y semán-

tica FXML como herramienta formal base para las contribuciones de esta tesis. Ellector puede referirse al Appendix A y a [6, 54] para una definición profunda yformal.

Sintaxis. Una especificación FXML describe a un cómputo paralelo definiendoel comportamiento esperado de cualquiera de sus implementaciones válidas co-mo un conjunto de órdenes parciales. El cuerpo de una especificación FXML secompone de bloques llamados pnodes. Los tipos de pnode básicos son varia-bles (var), declaraciones de funciones (fun), asignaciones de valores a variablesy código básico. Los pnode básicos se ejecutan de manera atómica. Los pno-des se pueden combinar con construcciones de ejecución secuencial: seq, while,for, if-then-else, y con construcciones de ejecución en paralelo: par (bloques decódigo en paralelo) y forall (ciclo for paralelo). Los pnodes pueden ser etique-tados. Los pnodes dentro de los bucles (for, while, forall) se indexan de formaautomática y dinámica.

El paralelismo puede restringirse especificando dependencias de datos. Elenunciado dep Q(i)→P(i) especifica una dependencia de datos entre las apari-ciones de las asignaciones etiquetadas Q y P, lo que significa que la ocurrenciai-ésima de P debe usar el valor de la variable, digamos x, escrito por la ocu-rrencia de i-ésima de Q. Llamamos a esto una dependencia (i, i). FXML admitedependencias de la forma (i, g(i)), donde g es una transformación afín. Además,FXML proporciona algunos tipos predefinidos de dependencias: weak, es decir, elvalor de lectura será cualquier valor escrito, strong, es decir, cada valor escritodebe leerse al menos una vez, y bijective, es decir, cada valor escrito debe leerseexactamente una vez, en algún orden.

Las dependencias de datos y control determinan un orden parcial de ejecu-ción. La semántica de un pnode es un conjunto (posiblemente infinito) de órde-nes parciales (posiblemente infinitos), llamados ejecuciones, consistentes con laconjunción de restricciones impuestas por las dependencias.

1.3. Capítulo 4 - El patrón Producir–Consumir–Reducir

En este capítulo presentamos la primera contribución de esta Tesis: el patrónparalelo PCR, primero como un concepto informal y a continuación como defi-nición formal utilizando el lenguaje de especificación FXML y su semántica comosoporte.

1.3.1. Presentación informalEl patrón PCR busca expresar cálculos que consisten en productores consu-

miendo ítems de datos de entrada y generando, para cada uno de ellos, unconjunto de resultados que es consumido por varios consumidores que trabajan

1.3. CAPÍTULO 4 - EL PATRÓN PRODUCIR–CONSUMIR–REDUCIR 7

en paralelo. Sus resultados finalmente se agregan a un solo resultado median-te un reductor. El objetivo del patrón es enfatizar la independencia entre losdiferentes cálculos para exponer todas las oportunidades de paralelización.

Producer Reducer

Consumer 1Consumer 1Consumer 1

Consumer kConsumer kConsumer k


Figura 1.2: El patrón PCR.

Topología. La Figura 1.2 representa la forma general de un PCR. Las flechasrepresentan conexiones de datos en un PCR; las flechas completas modelan lasposibles múltiples fuentes de entrada y el único canal de salida al entorno ex-terno. Las flechas discontinuas denotan los canales de datos internos. Todos loscanales de entrada al PCR están disponibles para consumo de cualquier compo-nente interno. No se permiten ciclos de datos entre los componentes internos2: la red es en sí misma un gráfico acíclico dirigido (DAG) cuyo ordenamien-to topológico tiene al productor y al reductor como primer y último elementorespectivamente.

Flujo de datos. El flujo de información dentro de PCR es el siguiente. Paracada elemento de datos de entrada, el componente producer genera un conjun-to de valores de salida; cada uno está disponible de inmediato para leer. Loscomponentes consumer leen los valores del alcance externo y de los canales dedatos privados para realizar sus cálculos. Finalmente, un componente reducercombina valores de una o más fuentes de datos procedentes del producer y ceroo más consumers, generando un único elemento de salida para cada elemento deentrada externo procesado por el productor. Las lecturas en los canales de datosno son destructivas; el mismo valor puede ser leído por cualquier consumidor ypor el reductor. Ningún consumidor ignora ninguna entrada: todas las flechasdiscontinuas en el gráfico tienen la misma cantidad de elementos de datos paraleer.

Concurrencia. Los elementos producer, consumers y reducer trabajan en pa-ralelo sujetos a las dependencias de datos existentes: todos los elementos deentrada deben estar disponibles para una instancia de productor, consumidor oreductor para realizar su cálculo. Cada producer, consumer y reducer puede po-tencialmente generar tantas instancias de ejecución paralela como sea necesariopara cada carga de trabajo específica. Tanto la naturaleza de una instancia deejecución (proceso o subproceso local o remoto) como la política de planificaciónestán definidas por cada implementación PCR subyacente.

2La composición cíclica a través de la recursión se explora en el Sección 1.4


1.3.2. Definición formal de PCRsPara proporcionar tanto un lenguaje de especificación como una semánti-

ca formal al concepto PCR elegimos el lenguaje FXML y proponemos extensionessintácticas para facilitar la escritura de instancias de PCR.

〈PCR〉 ::= 〈PCR-name〉 ( 〈param-list〉 ) 〈body〉〈body〉 ::= par 〈producer〉

{forall p {par 〈cons-list〉1 } }〈reducer〉

〈producer〉 ::= p = produce 〈f-name〉〈var-list〉〈cons-list〉i ::= 〈consumer〉i 〈cons-list〉i+1 | ε〈consumer〉j ::= cj = consume 〈f-name〉〈var-list〉 |

cj = iterate 〈cnd〉〈f-name〉〈var-list〉〈reducer〉 ::= r = reduce〈cnd〉〈f-name〉〈init〉〈v-list〉

〈param-list〉 ::= 〈param〉 , 〈param-list〉 | 〈param〉〈v-list〉 ::= 〈var〉〈v-list〉 | 〈var〉〈f-name〉 ::= 〈PCR-name〉 | 〈basic-fun-name〉〈init〉 ::= 〈basic-fun-name〉〈param-list〉

Cuadro 1.1: Gramática de un PCR.

Sintaxis La sintaxis de PCRs se define en la Tabla 1.1. Llamamos var a unnombre de variable y param a un parámetro formal. Nos referimos como funcio-nes básicas a funciones provistas por el usuario implementadas en el lenguaje deprogramación base. Para simplificar la presentación, restringimos la gramáticapara incluir siempre un productor y un reductor. En la Sección 1.5 relajaremoseste requisito en casos especiales donde omitimos el par productor / reductor.

Semántica. Para definir la semántica formal de PCRs, comenzamos proporcio-nando la especificación FXML correspondiente de cada bloque de creación: produce,consume, iterate, y reduce, para el caso básico donde f es una función básica (Ta-bla 1.2).

1.3.3. PCRs como funciones matemáticasEn esta Tesis enunciamos la Propiedad 1 de los PCRs en FXML:

Propiedad 1. El resultado de evaluar un PCR en semántica FXML es una fun-ción de sus parámetros de entrada asumiendo que a) las funciones básicas delproducer, consumer y reducer functions son totales, y b) no se introducen ciclosde datos al hacer look-ahead en las funciones básicas.

1.4. CAPÍTULO 5 - EXTENSIONES DE PCR 9

PCR FXML

p = produce f x 1 forall(i=0;i<bndf (x);++i)2 p = f(x, p, i)

cj = consume fj x p c1..ck 1 cj = fj(x, p, c1..ck)

cj = iterate cnd fj z

1 seq

2 y = z3 repeat y = fj(y)4 until cnd(y)5 cj = ylast

r = reduce cnd⊕ v0 z1..zq

1 seq

2 v = v03 for p4 v = v[-1]⊕ 〈z1..zq〉5 if cnd(v) then break

6 r = vlast

Cuadro 1.2: Bloques básicos de un PCR y su especificación FXML .

1.4. Capítulo 5 - Extensiones de PCREn este capítulo presentamos varias extensiones del modelo básico PCR defi-

nido hasta ahora. Su uso se ilustra en el Sección 1.5.

1.4.1. Paralelismo recursivoLa Propiedad 1 muestra que cualquier PCR se comporta como una función

total. Esto permite llamar a PCRs desde cualquier función básica en un modobloqueante, es decir, donde el llamador espera hasta que la llamada regrese. Porsupuesto, incluso si el código llamador está bloqueado, se conserva el paralelismodentro del destinatario. Llamar a PCRs como funciones permite hacer paralelismorecursivo. Un ejemplo simple se muestra en la Figura 1.3. El mismo cuenta losnúmeros de Fibonacci que son a su vez primos eliminables, es decir, aquellos quetienen la propiedad de que eliminar los dígitos de uno en uno en algún orden dacomo resultado un nuevo número primo en cada paso.

Notar que la recursión presentada anteriormente no está implementada conanidamiento de PCRs sino a través del mecanismo de recursión nativa del lenguajede programación. Esto significa que conseguimos recursión ilimitada salvando elproblema del infinito anidamiento de PCRs.

Dividir y conquistar. El ejemplo más destacado de paralelismo recursivo esel método de dividir y conquistar, una técnica algorítmica clásica que consiste endividir una instancia compleja de un problema en varias instancias más pequeñasdel problema original, resolver cada una de forma independiente y combinar sussoluciones para calcular el resultado final. Cada subproblema puede resolverse


1 fun isPrimeAndDeletablePrime(x) = if isPrime(x)

2 the isDeletablePrime(x) else false

3 fun delete(x,i) = ... // elimina el i-esimo digito de x

1 PCR countFibDeletablePrimes (N):

2 par

3 p = produce fib N

4 forall p

5 c = consume

6 isPrimeAndDeletablePrime p

7 r = reduce sum 0 c

1 PCR isDeletablePrime (X):

2 par

3 p = produce delete X

4 forall p

5 c = consume isPrimeAndDeletablePrime p

6 r = reduce or false c

Figura 1.3: Ejemplo de llamada PCRs dentro de las funciones básicas

directamente si es lo suficientemente simple; de lo contrario, dividir y conquistarpuede aplicarse recursivamente.

Una forma común de describir la solución de dividir y conquistar de unproblema es definir alguna variante de las siguientes funciones:

is_base (x), un predicado comprobando si el problema x es un caso base;

base (x), que calcula la solución para el caso base x;

divide (x), que particiona el problema x en un conjunto de subproblemas;y

conquer (x, ca, cb), que describe cómo combinar las soluciones ca y cbusando el problema principal x como contexto; esto supone que las N so-luciones c1, . . . , cN de los subproblemas de divide (x) se pueden combinarcon conquer (x, cN (... , conquer (x, c1, null))) ...) 3.

La Figura 1.4 muestra una solución paralela basada en PCR. El productordivide el problema original en subproblemas utilizando la función iter_divide.Según el resultado de is_base, los consumidores procesan cada subproblema uti-lizando base o recurriendo a PCR divide _and _conquer. El reductor usa conquer paracombinar todas las soluciones de subproblemas. null es el subproblema vacío.La función terminate se usa para definir una condición de parada de eureka.

El uso del patrón dividir y conquistar se ilustra con el problema de N-Queensen la Sección 1.5.

1.4.2. Redes de PCRLos PCRs imponen una dependencia de (i, i) entre su entrada y su salida.

En otras palabras, para cada valor de entrada siempre hay exactamente unasalida. En algunos escenarios, es útil relajar esa dependencia para reenviar más(o menos) elementos de una fuente PCR a uno o más objetivos. Algunos ejemplosde conexiones entre dos PCRs A y B pueden ser:

3Denotamos null como el subproblema vacío y conquer se supone que está bien definido sialguno de sus parámetros es null, es decir, para escenarios con uno o cero subproblemas.


1 fun divide, is_base, base, conquer, terminate

2 fun subproblem(x) =

3 if is_base(x) then base(x) else divide_and_conquer(x)

4 fun iter_divide(x, i) = divide(x)[i]

5

6 PCR divide_and_conquer (x):

7 par

8 p = produce iter_divide x

9 forall p

10 c = consume subproblem p

11 r = reduce terminate conquer null x c

Figura 1.4: PCR para dividir y conquistar.

agrupar A salidas en tamaño variable buckets para ser procesadas por B;

particionar cada salida de A en un número variable de elementos leídospor B;

monitorear algunas salidas de A con B sin cambiar el comportamiento deA.

En todos estos casos de uso, la aplicación de la dependencia (i, i) es incompatibleo costosa en términos prácticos. Para resolver esto, proponemos un mecanismode conectar dos PCRs de la siguiente manera. La operación connect (in, out, d)es un puente del resultado de PCR A escrito a la variable in hacia el PCR B leyendode la variable out a través de una función delegate d(in, out): para cada valor desalida v escrito a in, se llamará al delegate d. La ejecución de d podría ignorarv en función de alguna condición o reenviar el valor fk(v), para alguna funciónfk, al PCRs B mediante la asignación out = fk(v)

La Figura 1.5 muestra un ejemplo de connect junto con su semántica deejecución.

La función delegate makeBucket se ejecuta para cada entrada y1 del PCR preProcess

y puede reenviar cero o más salidas al PCR doStatistics. El diagrama represen-ta una posible ejecución. Las etiquetas P , B y S representan instancias depreProcess, makeBucket y doStatistics, respectivamente.

Semántica de connect. Agregar un delegado intermedio y opaco permite rom-per la dependencia (i, i). La red de PCRs resultante ya no es un PCR. En efecto, alagregar connect se extiende el modelo a una jerarquía de dos niveles: un nivel dePCRs con dependencias internas (i, i), y un segundo nivel de conexiones entre PCRs

que tienen dependencias entre conjuntos indexados de tamaños potencialmentediferentes según se derivan del comportamiento de los delegados participantes.La forma de modelar esto en FXML es usar dependencias weak en la conexiónentre los PCRsorigen y destino.

1.4.3. Feedback LoopsAlgunos cálculos requieren producir, para cada elemento de entrada, uno o

más resultados, cada uno de los cuales puede ser una salida o retro-alimentarse


1 PCR preProcess(x): ...

2 PCR doStatistics(x): ...

3 delegate (in, out) makeBucket = lambda size:

4 if (mod (idx in) size == 0) then

5 out = bucket(x[0]..x[size-1])

6

7 par

8 forall x1:9 P1: y1 = preProcess (x1)10

11 forall x2:12 P2: y2 = doStatistics (x2)13

14 connect (y1, x2, makeBucket(3))

P 1 P 2P 0 P 5P 3 P 4

S1S0

D0 D1

Figura 1.5: (Izquierda) Conexión de preProcess y doStatistics por medio del delegatemakeBucket. (Derecha) Un posible orden parcial.

hacia el mismo componente. Los lenguajes de programación para aplicacionescon paralelismo de datos basadas en flujos de datos, como StreamIt [50], pro-porcionan construcciones explícitas para especificar ciclos de retroalimentación.En el contexto del paralelismo basado en tareas, este comportamiento corres-ponde al patrón workpile, donde una instancia de una tarea puede generar másinstancias y agregarlas a una pila de tareas a realizar [41]. Para modelar dichoscomportamientos, ampliamos los PCRs con:

o = feedbackloop f v

donde o es la variable de salida, f es una función y v es un valor. La Tabla 1.3esboza la semántica operacional big-step de feedbackloop. La notación A ` p ⇓ A′significa que la ejecución del programa p en un conjunto de asignaciones indexa-das A genera el conjunto de asignaciones indexadas A′. Usamos E para denotarel conjunto de asignaciones indexadas de las variables “ externas ”, mientrasque X denota las asignaciones indexadas de la variable x, que es “interna” alafeedbackloop. Usamos la notación 〈E ,X〉 para hacer explícita la separación en-tre variables externas e internas.

Fdb〈E , {v}〉 ` o = feedbackloop f v ⇓ 〈E ′, ∅〉

E ` o = feedbackloop f v ⇓ E ′

DoWorkxi ∈ X f i(xi) = Oi ∪Xi

〈E ,X〉 ` o = feedbackloop f v ⇓ 〈E ]Oi,X \ xi ]Xi〉

Cuadro 1.3: Semantics of feedbackloop.


Una diferencia importante con la construcción de FXML forall es que x seasigna dentro del cuerpo de forall , por lo tanto, el número total de instanciasde x no se conoce de antemano y la estructura general del cálculo es irregular.Además, feedbackloop es diferente de iterate en dos aspectos: 1) cada instanciapuede generar una salida, y 2) las instancias se pueden ejecutar tan pronto comose mantengan sus dependencias.

De hecho, la construcción feedbackloop implica una extensión adecuada delmodelo básico PCR ya que de alguna manera combina las capacidades del pro-ductor y del consumidor: consume cada instancia xi de x y produce un conjuntode salidas indexadas oj0 . . . ojmi y de nuevas instancias xk◦0 . . . xk◦ni de x paraser consumidas. Sin embargo, se puede componer con un reductor para obtenerun PCR de la siguiente manera:

1 PCR P (v):

2 par

3 o = feedbackloop f v

4 r = reduce cnd ⊕ r0 o

El uso de este patrón se ilustra con el caso de estudio N-Queens en el Sec-ción 1.5.

1.4.4. Información de ejecución de PCRPara poder generar implementaciones eficientes de instancias PCR es necesario

contar con información adicional sobre las propiedades de tiempo de ejecuciónde los componentes PCR especificados. Como primera aproximación, en esta tesisdefinimos varias sugerencias de rendimiento a nivel PCR que pueden ser propor-cionadas por el escritor de la especificación PCR para influir en la generación decódigo y en el comportamiento en tiempo de ejecución de la implementaciónsintetizada.

Sea P un PCR y sea c cualquiera de sus componentes productor, consumi-dor y reductor con los parámetros de entrada p1 . . . pk. Definimos la tupla deinformación de tiempo de ejecución Rc = 〈Dc, Lc, Gc〉 donde

Dc : [1 . . . k]→ Index→P(Index) es un mapa de información de depen-dencias, especificando, para cada parámetro de entrada pj , índice FXML Iy componente c, el conjunto Dc(j)(I) = {J ∈ Index | la ejecución cJ leeel valor pIj};

Lc : Index → N es una mapa de ubicaciones, dando, para cada ejecuciónindexada cI , un número natural Lc(I) que codifica una ubicación sugeridaque dependerá de la implementación; y

Gc ∈ N es un tamaño de grano, lo que sugiere que Gc ejecuciones de cI sedeben agrupar para mejorar el rendimiento global.

La información en Rc puede ser proporcionada total o parcialmente por elautor de la especificación para permitir que el entorno de ejecución apliquemejoras durante la ejecución que dependerán de las capacidades de el o losmodelos de concurrencia concreta elegidos para implementar la instancia PCR.


1.5. Capítulo 6 - Casos de estudioEn este capítulo, ilustramos cómo PCRs, junto con las extensiones discutidas

hasta ahora, permiten expresar patrones de programación paralelos comúnmenteutilizados [41] y facilitar la comparación de diferentes enfoques para paralelizarproblemas específicos.

Low Pass Filter Este ejemplo es una interconexión lineal de tres etapas decómputo: filtro de paso bajo (LPF), demodulator (DEMOD) y equalizer (EQU ) [50].Un aspecto interesante es que las etapas de cómputo consumen una ventanadeslizante de elementos consecutivos para generar su salida: LPF y EQU consumenNUM_TAPS elementos de entrada consecutivos para producir un solo elemento desalida, mientras que DEMOD usa dos valores consecutivos para generar una salida.Una forma común de modelar esto es usar almacenamientos intermedios parainterconectar las etapas de la tubería. Con PCRs, se modela naturalmente me-diante operaciones look-ahead leyendo una ventana en la historia de elementosde las secuencias de entrada.

Backwards Reachability La verificación teórica de las propiedades de segu-ridad de autómatas [15] se puede ver como el problema de verificar si se puedealcanzar un conjunto S de estados de error a partir de un conjunto I de estadosiniciales. La Figura 1.6 describe el algoritmo de alcanzabilidad hacia atrás pararesolver este problema. Utiliza iterate para iterar un PCR Bckwrds interno querealiza un paso de la alcanzabilidad hacia atrás. Para cada estado s en StSet,calcula sus predecesores llamando a PRED, que a su vez podría ser un PCR. C esla condición de detención. Puede definirse para terminar siempre que se calcu-la un punto fijo o se alcanza un estado inicial, en cuyo caso se puede usar unreductor eureka en PCR Bckwrds.

1 // funciones de estado

2 estados divertidos, sunion

3

4 // punto de entrada de PCR

5 PCR BackReach (S):

6 StSet = iterar Bckwrds S C

1 // un paso de alcance hacia atras

2 PCR Bckwrds (StSet):

3 par

4 s = produce states StSet

5 forall s

6 p = consume PRED s StSet

7 r = reduce sunion StSet p

Figura 1.6: PCR “ Alcanzabilidad hacia atrás ”

Count-Words Dado un texto T y un conjunto W de palabras, count-wordscalcula el número de apariciones de cada w ∈W en T . Este es un ejemplo típicode MapReduce [19].

Problema de las N Reinas. Este problema consiste en colocar N reinas enun tablero de ajedrez de N × N , sin que dos reinas compartan la misma fila,columna o diagonal. Ilustra la técnica de backtracking, ya que las soluciones can-didatas se pueden construir colocando una reina en cada fila del tablero hasta

1.5. CAPÍTULO 6 - CASOS DE ESTUDIO 15

que a) N reinas se colocaron con éxito (se encontró una solución completa), o b)no se pueden agregar más reinas sin provocar conflictos, o c) todas las configu-raciones del tablero han sido probadas. En cualquiera de estos casos, se realizaun paso hacia atrás para probar otras configuraciones posibles modificando elposicionamiento de la última reina colocada y probando el siguiente casilleroposible para la última fila. Esto se puede paralelizar visitando las ramas deretroceso en paralelo y combinando resultados parciales.

Weather

Parents

Visiting

Parents

Visiting

Parents

Visiting

Money

TennisCinema

Shopping

Stay in

sunny

windy

rainy

richpoor

noyes

yes no

noyes

Cinema

Cinema

Cinema

Un árbol de decisiones

Algoritmo ID3 El aprendizaje por árbolesde decisión [4] es un método de aprendizajeautomático que consiste en crear y usar unárbol de decisión para clasificar entradas deun conjunto de datos. Cada entrada consisteen una valuación para cada uno de los atri-butos de interés en el modelo. Los nodos in-ternos representan puntos de decisión de loscuales sale una rama por cada decisión posi-ble. Los nodos hoja están etiquetados con unaclase que representa el resultado del procesode decisión.

ID3 [44] es un método de aprendizaje automático para construir árboles dedecisión basado en un conjunto de datos de entrenamiento con ejemplos declasificaciones de valuaciones de entrada. En cada iteración, se elige un atributoa ∈ A en el que a particiona el conjunto de ejemplos con entropía mínima.Luego, para cada valor vi de a en S, el algoritmo genera recursivamente unsubárbol usando a como raíz, {e ∈ examplese[a] = vi} como subconjunto deejemplos y A − {a} como conjunto de atributos candidatos para particionar.Identificamos dos clases de oportunidades de paralelización independientes eneste algoritmo. En primer lugar, el patrón de división y conquista en paralelo(Sección 5.1) se puede usar para ejecutar al mismo tiempo las tareas de creacióndel subárbol independiente. En segundo lugar, las operaciones en paralelo bulkde datos, como los cálculos de frecuencia de valores, la entropía y la particiónestablecida, pueden expresarse naturalmente como PCRs.

InputFromPubSub

Deserialize

Translate

TagTeam

ScoreSentiment

ExtractKeywords

KeyByTeam

ExtractSentiment

SlidingWindow

Mean

OutputMeanToBigQuery

TweetTransformer

CalculateSentiment

CorrelateKeywords

Figura 1.7: Tubería deSentiment analysis

Sentiment Analysis Sentiment analysis [42] es lapráctica de combinar técnicas de análisis de texto yde lenguaje natural para medir cualidades subjetivas;por ejemplo, si una revisión de usuario escrita en unsitio web es positiva o negativa, qué emoción trans-mite un comentario de un foro (enojo, alegría) ...) ypropiedades similares. Por lo general, requiere proce-sar grandes cantidades de datos de entrada, ya que loscomentarios de millones de personas se utilizan parainferir una opinión general sobre un asunto específi-co, por ejemplo, en redes sociales, sitios de reserva dehoteles, etc.

Aquí estudiamos un ejemplo inspirado en [28] en elque se presenta una cartera de operaciones en tweetscomo parte de la presentación del producto Google


Cloud Dataflow. La Figura 1.7 (right) representa un escenario en el que tweetscon respecto a un partido de fútbol se leen desde un editor de la nube / puntofinal del suscriptor (InputFromPubSub). El componente TweetTransformer aplica va-rias transformaciones en ellos, seguidas por dos análisis diferentes en los tweetstransformados, de los cuales nos centramos en el componente CalculateSentiment

- extrayendo los sentimientos de los tweets y agrupándolos en cubos de N- inter-valos de minutos Notar que las operaciones TweetTransformer se pueden aplicaren paralelo a cada tweet entrante.

1.6. Capítulo 7 - Un modelo concreto de ejecu-ción para los PCRs

En este capítulo presentamos las Concurrent Collections [12]; un modeloobjetivo concreto para implementar PCRs.

1.6.1. El modelo Concurrent Collections

t0

s0 s1

i1

i2

Environment

posts tags

into t0 and

data into i0

Environment

consumes

output from i2t1i0

Figura 1.8: Diagrama de un grafo CnC. Los diamantes, óvalos y cajas representanrespectivamente colecciones de tags, steps e ítems. Las líneas punteadas indicanrelaciones prescribe (de control) entre las colecciones de tags y steps; las flechasdenotan las relaciones produce / consume / control entre las colecciones de steps,

items y tags.

Elementos básicos. Un programa CnC consiste en una descripción de altonivel de un gráfico de computación, código heredado para ser ejecutado y elentorno. Los átomos básicos son steps de computación que alojan código he-redado a ser ejecutado; instancias de control o tags, donde cada valor de tagrepresenta la orden de una unidad de trabajo que se realizará con cada stepdependiente; y items de datos, que son leídos y escritos por los steps durantesu trabajo de cómputo.

Relaciones. El grafo CnC se construye interconectando colecciones de los ele-mentos introducidos. Cada colección de tags prescribe una cantidad de stepsde cálculo. Los valores de los tags pueden ser put (insertados) en las coleccionesde tags por el ambiente o por otros pasos; en el segundo caso, decimos que elstep controla dicha colección de tags. La publicación de un nuevo valor de taggenerará una instancia de cada colección de step controlados con el valor de tagcomo parámetro. Cada colección de items es un mapa concurrente que almacenaelementos indexados por valores de tags; proporcionando operaciones get y put.Los items son get/put por el entorno o por otros steps; en el segundo caso, lossteps consumen desde / producne hacia la colección de items. Para garantizar

1.6. CAPÍTULO 7 - UNMODELO CONCRETODE EJECUCIÓN PARA LOS PCRS17

el determinismo, la semántica CnC prohíbe la sobreescritura de tags o items encualquier colección (Dynamic single assignment). La Figura 1.8 muestra unainstancia de CnC e ilustra los bloques de construcción básicos.

Semántica. CnC se describe mediante una semántica operacional considerandoun lenguaje simplificado llamado Featherweight-CnC [12], que formaliza el con-cepto de iniciar pasos cuando se publican tags. La memoria (el contenido delas colecciones de elementos) se expresa como una matriz de memoria planallamada data; se asume que los contenidos de cada colección de items se asignaunívocamente a ubicaciones de la memoria data. Las colecciones de tags se re-presentan mediante reglas de reescritura de generación de steps en la semánticaoperacional.

Describimos brevemente los elementos de la sintaxis Featherweight-CnC rele-vantes. Los steps de cómputo CnC se escriben como funciones con un solo pa-rámetro de entrada de valor de tag. La ejecución de un step S se desencadenapor la operación prescribe S. La memoria está representada por el objeto globaldata con get (i) para leer el valor almacenado en la ubicación de memoria i yput (i, v) para escribir el valor v en la ubicación de memoria i. El requisito deAsignación única dinámica de la semántica CnC significa que puede haber comomáximo una ejecución data.put (i, v) para cada valor posible de tag i.

1.6.2. Traducción de las especificaciones de PCR a CnCHabiendo elegido CnC como un modelo de cómputo interesante para imple-

mentar PCRs, definimos una traducción de PCR hacia Featherweight-CnCpara rela-cionar las dos semánticas. Después, mostramos que la semántica del código CnC

es funcionalmente equivalente a la semántica FXML del PCR.

Modelo de memoria CnC. Como los steps de cómputo CnC no tienen estado,la única memoria utilizada por un programa CnC es el almacenamiento de suscolecciones de elementos. En su definición de semántica operacional, CnC simpli-fica el modelo de memoria formal mapeando todas las colecciones de elementosen una única estructura plana data. En esta traducción, nosotros suponemosque este mapeo de aplanamiento de la memoria es dado, pero que se hace unseguimiento de la memoria asignada a cada colección de elementos. Para lograresto, denotamos, para la colección de elementos asociada con la variable FXML x,a datos x como la memoria reservada a ella; por lo tanto, data x.get(t) denota elvalor almacenado en data para el valor de tag t en la colección de items asociadaa x. Del mismo modo, data x.put(t, v) denota una operación de escritura en lacolección de items asociada a x del valor v para el valor de tag t.

Variables e índices FXML. En CnC, las colecciones de items representan elhistorial de asignaciones sobre variables FXML . El índice de cada asignación es elvalor del tag utilizado como clave en la colección. Como los índices FXML puedenser multidimensionales debido a la anidación de espacios de iteración, el tipo deltag es un vector de valores enteros. Al leer una variable FXML indexada con unadimensión menor inferior, la implementación trunca la dimensión del tag dellector a la dimensión de la variable leída. De esta forma, la operación de lectura


realizada en la colección de items de origen está bien definida para valores detag de cualquier dimensión.

Traducción de look-ahead / look-behind. Para traducir estas operacionessobre variables FXML en CnC necesitamos reescribir las funciones básicas f usadasen las primitivas produce, consume y reduce . Sea una función f(s1, .., sk) = E convariables de entrada FXML si escrita como una expresión E, y sea un índice I,definimos el operador de traducción f I como

f(s1, .., sk)I≡ E[si[d] := datasi

.get(I + d)]i∈1..k

donde E[x := y] denota la sustitución sintáctica de x por y en la expresión E.Sin pérdida de generalidad, suponemos que cada variable si aparece en E en laforma si[d] para un entero d, y que d es un valor y no una expresión.

Producer. Dada una función de productor f(x1..xn, p, i) y una expresión en-lazada bound f (x1..xn), la CnC translation ofp = produce f x1..xn Se define como

1 producef (I) {

2 for (i = 0; i < boundf (x1..xn)I; ++ i)

3 {

4 data p.put(I ◦ i, f(x1..xn, p, i)Ioi

)

5 PCR_prescribe consume 1...k (I ◦ i)6 prescribir reducir (I ◦ i)7 }

8 prescribir reduce-end (I ◦ i)}

produce

inputs

p

consume 1..k

reduce

reduce-end

Notar que la asignación del productor incrementa la dimensión del índice.Esta traducción del productor prescribe un step reductor especial reduce-end

con un valor de tag que codifica la cantidad total de items para procesar. Paraprescribir a los consumidores, definimos el macro PCR_prescribe C que se traduceen prescribe C si C es una función básica, y en prescribe produce Csi C es unPCR anidado, donde produce C es el producto correspondiente paso para la CnC

traducción de PCR C.

Consumer. Dada una función básica f(x1..xn, p, c1..ck), la CnC traducción decj = consume f x1..xn, p, c1..ck es

1 basic_consumef (I) {data cj . put (I, f(x1..xn, p, c1..ck)I)}

consumeinputs c

Si el consumidor es un PCR P , la traducción no genera un step CnC sino queexpande recursivamente la definición en las traducciones CnC correspondientes deproductor, consumidor y reductor de P . El paso del productor anidado realiza lasoperaciones get para los parámetros de entrada, y el paso del reductor realizala operación put en su colección de elementos de salida que corresponde a lavariable cj .

1.6. CAPÍTULO 7 - UNMODELO CONCRETODE EJECUCIÓN PARA LOS PCRS19

Reducer. La implementación CnC del reducer PCR r = reduce cnd f v0 z1..zq,con f(v, z1..zq), se define como

1 reducef (I ◦ i) {

2 if (i == 0) u = v03 else u = datav. get (I ◦ (i− 1))4 u = f(u, z1..zq)

I◦i

5 si cnd (u) data r. put (I, u)

6 else datav. put (I ◦ i, u)

7 }

8 reduce-endf (I ◦ i) {

9 datar. put (I, datav. get (I ◦ (i− 1)))}

reduce

inputs

v

rreduce-

end

El reductorcompacta el espacio de iteración anidado (con los tags I◦0 dotsI◦(k − 1)) en el del alcance externo (con el tag I). Cada step reduce ejecuta unaoperación leyendo el resultado del step anterior almacenado en la colección v, ousando el valor inicial v0 (para el primer step reductor). u es una variable localdel step en el lenguaje de programación. Después de comprobar cnd, el step deejecución publica el resultado en el resultado final r o v. reduce-end es finalmenteprescripto por el productor y reenvía el último valor de v al resultado r. Seejecuta exactamente una de las operaciones put de las líneas 5 y 9. Si se omite laoperación cnd, se supone la función constante false (es decir, ninguna entradaproducirá un evento eureka).

Iterate. La implementación de cj = iterar cnd f z es una traducción directadel pseudocódigo en la Tabla 4.2.

1 iteratef (I) {

2 i = 0

3 datay.put(I ◦ i, f(z)I)

4 repeat

5 datay.put(I ◦ (i+ 1), f(y)Ioi

);

6 i = i + 1;

7 until cnd(y)Ioi

8 datacj. put (I, datay.get(I ◦ i))}

input

iterate

y

c

Feedback Loop. Dada una función f , la traducción CnC de o = feedbackloop f ves


1 fdb-startf (I) {

2 i = 0

3 data x.put(I ◦ i, data v. get (I))4 prescribir fdbf (I ◦ i)5 }

6

7 fdbf (I ◦ i) {

8 O,X = f(x)Ioi

9 para u ∈ X10 j = nextindex (datos x, I)11 data x.put(I ◦ j, u)

12 prescribir fdbf (I ◦ j)13 para u ∈ O14 j = nextindex (datos o, I)15 data o.put(I ◦ j, u)

16 PCR_prescribe consume 1...k (I ◦ j)17 }

fdb-start

input

o

consume 1..k

fdb

x

fdb

El código CnC en los pasos fdbf y fdb-startf usa una función atómica nextindex

para obtener un índice nuevo para usar en las operaciones put. Realizar llamadask a nextindex devolverá índices consecutivos de 0 a k− 1 para cada combinaciónde parámetros de entrada. De nuevo, u es una variable local del paso en el idiomadel host.

1.6.3. Corrección semántica

La implementación de PCRs usando grafos CnC conserva la semántica funcionalde la especificación original PCR calculando el mismo valor de salida.

Semánticas FXML y Featherweight-CnC. FXML proporciona semántica a lasespecificaciones de forma denotacional, que describe el conjunto de ejecucionesválidas para cualquier implementación de la especificación. Por el contrario, CnCda semántica operacional de paso pequeño a sus programas. Para relacionarambos, comparamos su comportamiento funcional; a fin de relacionar valoresasignados a variables indexadas FXML con ubicaciones en colecciones de items CnC

de la implementación propuesta.

Proposición 2. Sea P una definición PCR de aridad n, y sean v1..vn variablesFXML con asignaciones de índice I cada una, y sea v = yI = P (vI1 ..vIn) el resultadoFXML de la evaluación de P con vI1 ..vIn como parámetros, y sea data v1..vn

una CnC

región de memoria que almacena el historial completo de asignaciones a v1..vn,y sea P̂ la traducción de P a CnC. Entonces, la ejecución CnC de P̂ (I) almacenaen data I

y el valor v dado por [[P ]].

1.7. CAPÍTULO 8 - UNA LIBRERÍA C++ PARA DESARROLLAR PCRS21

1.7. Capítulo 8 - Una librería C++ para desa-rrollar PCRs

Como parte de este trabajo, se desarrolló un conjunto de plantillas C++ paraproporcionar una descripción agnóstica de alto nivel de PCRs. La biblioteca deplantillas maneja anidamiento de PCRs e integración con código básico. En estecapítulo presentamos una breve discusión de la biblioteca y las elecciones dediseño realizadas.

1.7.1. Interfaz C++ de programación genérica de PCRs

En C++, los PCRs se especifican como composiciones de plantillas de defini-ciones de tipo. Un PCR se especifica como una instancia de la plantilla pcr pa-rametrizada con la lista de los tipos de elementos: entradas Ti, productor P,consumidores Ci y reductor R:

pcr <T_1, ..., T_n, Productor, Consumer_1, ..., Consumer_k, Reducer>

Elemento Especificación C++ Notas

T_i pcr_in<type(Xi) > tipos de parámetros de entrada

Producer pcr_produce<PROD,parameters...> PROD: declaración de productor bási-co

pcr_feedback<FDB,parameter> FDB: declaración de feedbackloop bá-sico

Consumer pcr_consume<CONS,parameters...> CONS: consumidor básico o PCR

pcr_iterate<PCR,parameter> Iteración de PCRanidadoReducer pcr_reduce<RED, parameters...> RED: declaración de reductor básico

PROD producer<decltype(fp),fp , tuner> productor básicoFDB feedback<decltype(fp),fp , tuner> feedbackloop básico

CONSconsumer<decltype(fp),fp , tuner> consumidor básico

pcr<...> consumidro PCR anidadoRED reducer<decltype(fp),fp , tuner> reductor básico

Cuadro 1.4: Resumen de plantillas C++ usadas para especificar PCRs.

La Tabla 1.4 (arriba) resume la sintaxis para especificar T i, P, C y R.En todos los casos, parameters ... es una lista i1, ..., ik de constantes enteraspositivas que especifican los parámetros fuente como posiciones relativas haciaatrás en la lista del cuerpo del PCR ; es decir, ik significa “el resultado del ik-ésimoprevio elemento del cuerpo del PCR ”. El caso especial de feedbackloop toma unsolo parámetro de entrada.

La Tabla 1.4 (abajo) muestra la sintaxis para declarar elementos de tipoPROD, FDB, CONS y RED. Se requiere un puntero de función fp a una función deusuario en cada caso. El parámetro tuner es un tipo que implementa la tupla deinformación de tiempo de ejecución opcional descripta en la Subsección 1.4.4.Su tipo predeterminado es proporcionado por el framework. Los contenidos deltipo tuner se detallan más adelante en esta sección (ver Tabla 1.6).


Parámetros de función de usuario. Las funciones básicas en el ejemplode countFibPrimes tienen sus parámetros de entrada decorados con la interfaz deplantilla pcr_var 4. Este tipo de decoración abstrae las variables FXML en el códigodel lenguaje de base y permite las operaciones de flujo look-ahead y look-behinddescritas en la Sección 3.2. La Tabla 1.5 resume las operaciones implementadaspor esta interfaz.

Descripción de la operación

operator T () leer valor de índice implícitoint idx () devuelve índice implícito actual

operador T [] (int) look-ahead / look-behind relativo al índice implícito

Cuadro 1.5: Parámetro de función pcr_var operadores de envoltura para funcionesbásicas.

Evaluación de PCRs como funciones. Para habilitar la interacción con elentorno, la plantilla pcr implementa la función llamada a función operator ()

que pasa las entradas dadas al patrón PCR de manera no bloqueante, ya que elcaso de uso esperado es alimentar al PCR con una secuencia de valores para pro-cesar antes de empezar a consumir los resultados. Éstos se devuelven de formasíncrona y proporcionan una interfaz C++ de tipo iterator junto con los métodosbegin y end siguiendo las convenciones estándar de la biblioteca de iteradores dellenguaje C++ estándar.

Conección de PCRs. La sección 1.4.2 describe una forma general de conectarmúltiples PCRs en una red de árbol. Esta tesis proporciona, como prueba deconcepto, dos formas limitadas de conección de PCRs: un connect entre dos PCRs

por medio de un delegado y branch bifurcando el resultado de un PCR a loscanales de entrada de otros dos PCRs.

Información C++ para optimización de PCRs. Para permitir el ajuste derendimiento de la implementación como se describe en la Subsección 1.4.4, lainterfaz de programación PCR C++ proporciona parámetros de plantilla opcionalesque modifican el comportamiento en tiempo de ejecución. La Tabla 1.6 describela interfaz C++ para proporcionar información de ajustes de rendimiento de unamanera independiente de la implementación.

La tupla Rc de la Subsección 1.4.4 se relaciona con los métodos de Tabla 1.6de la siguiente manera.

La información provista por la asignación de información de dependenciaD tiene que ser provista por los métodos expected_reads, ingredients_indexedy consumer_indexes.

La asignación de ubicación L corresponde a la salida del método locations.

El tamaño del grano G corresponde a GROUP_SIZE.4 La decoración de los parámetros de las funciones básicas es equivalente al operador de

traducción EI presentado en la Subsección 1.6.2.

1.7. CAPÍTULO 8 - UNA LIBRERÍA C++ PARA DESARROLLAR PCRS23

Dato PCR Descripción

expected_reads (N) especifica el número de lecturas de cada índice del parámetroN

consumed_indexes (N) especifica el conjunto de índices consumidos del parámetroN-ésimo para cada índice de consumo

consumer_indexes (N) especifica el conjunto de índices de consumidores del paráme-tro N-ésimo para cada índice de productores

locations (nlocs) index to location mappingsize_t GROUP_SIZE tamaño de grupo para usar en una reducción o consumidor;

el número de trozos paralelos de reducción para N elementosestará ceil (N / GROUP_SIZE)

Cuadro 1.6: Ajustes de rendimiento opcionales de PCR

El conjunto mínimo de parámetros descritos en la Tabla 1.6 puede ser uti-lizado por la traducción y el tiempo de ejecución. Todos los parámetros sonfunciones de cada índice de variable FXML para permitir un ajuste diferente paracada elemento a procesar.

1.7.2. Generación de código CnC ejecutable para PCRs

Una vez definidas todas las plantillas de la interfaz PCR, aplicamos técnicas demetaprogramación de plantillas para desplegar las definiciones PCR en un grafoCnC. En el marco CnC C++ un gráfico de computación está representado por unainstancia de la clase CnC :: context. Cada coleccion de steps, items y tags estarávinculada a la instancia de contexto CnC que la contiene.

Como los PCRs se pueden componer mediante el anidamiento; y el contexto CnC

es plano, la implementación genera una subclase CnC::context que representa elPCR completo en una representación plana. Las reglas de expansión de la plantillaconvierten una definición de pcr <X1, .., Xn, P, C1, ..., Ck, R> en una cadenade herencia pública con esta forma:

cnc (R): cnc (C k): ...: cnc (C 1): cnc (P): cnc (X n):...: cnc (X 1):CnC :: context.Aquí, cnc representa la clase CnC sintetizada para el componente PCR corres-

pondiente.

CnC implementation of pcr_var. Cada variable FXML se mapea en una colecciónde items que mapea cada asignación FXML al valor correspondiente. El tipo deplantilla tag_t representa un vector entero estático cuya dimensión dependerá delnivel de anidamiento PCR de una instancia de pcr_var; la operación last devuelveel valor de índice para el nivel de anidación más interno. El tipo de plantillavar_t abstrae el tipo del valor almacenado por el pcr_var.

Implementación CnC de la evaluación de PCRs. Para permitir la comuni-cación entre el PCR y su entorno, la traducción sintetiza una implementación CnC

que esencialmente realiza las operaciones descritas en la Subsección 1.6.2 con ladiferencia de que se envía una entrada al PCR está implementado por operator(),y el consumo de los resultados está encapsulado en la implementación CnC de lainterfaz iterator descripta en la Subsección 1.7.1.


Ajustes de rendimiento CnC. El entorno de programación C++ CnC admite laespecificación de tuners, que son sugerencias de optimización aplicadas a colec-ciones de steps, items y tags. Los tag tuners permiten la partición de rangos deetiquetas; que permite el procesamiento de un grupo de valores de tag por la mis-ma instancia de step, mejorando la localidad de datos. Los tag tuners tambiénpermiten la memoización de valores de tags. Los item tuners permiten al usua-rio especificar el número de operaciones de lectura esperadas en cada elementoalmacenado (get_count). Estos también permiten especificar en qué proceso (pa-ra escenarios distribuidos) se producirá / consumirá cada elemento almacenado.Los step tuners permiten al programador a) declarar las dependencias de datosespecificando qué valores de tag consumirá un step a partir de qué coleccionesde items en una ejecución de step específica y b) especificar la prioridad relativadel step y dar pistas de afinidad de hilos / procesos al planificador del sistema.Vale la pena mencionar el cancel_tuner que permite al programador señalar laterminación anticipada de todas las instancias en ejecución de un step.

Teniendo en cuenta las sugerencias de rendimiento descritas en la Tabla 1.6,analizamos su aplicación a la implementación CnC

expected_reads se usa en el step tuner del consumidor para declarar de-pendencias de datos y en el item tuner de cada entrada consumida paradeclarar la propiedad get_count para la optimización del uso de la memoria.

consumed_indexes y consumer_indexes se usan junto con location para imple-mentar las sugerencias de ajustes distribuidos consumed_on y produced_on deCnC.

GROUP_SIZE es utilizado por la implementación CnC para manejar el agrupa-miento de items para procesar en consumidores y reductores.

La Figura 1.9 muestra un esquema de la interacción de un contexto CnC consu-midor con un contexto CnC productor y el uso del ajuste con fines de optimizaciónpor parte de este último.

producer context

cnc tuner

producer context PCR tuner

producer context

steps

consumer context

producer context

output item collection

post expected reads

and locations

collect tuning information

set items get_count, produced_on

and consumed_on propertiesset execution location

self-declare

input dependencies

Figura 1.9: Uso de la información de ajuste de PCR en la implementación de CnC

1.8. Capítulo 9 - Evaluación experimentalEn este capítulo evaluamos la aplicación práctica de los PCRs en términos

de rendimiento. Nuestros objetivos son (a) validar la aplicabilidad de la técnica

1.8. CAPÍTULO 9 - EVALUACIÓN EXPERIMENTAL 25

de generación de código CnC desarrollada para PCRs, (b) validar la aplicabilidadde los ajustes de alto rendimiento PCR para la implementación CnC y (c) ilustrarPCRs como una herramienta para comparar diferentes implementaciones paralelaspara resolver el mismo problema.

Para cumplir con estos fines, primero comparamos el rendimiento de CnC

implementaciones codificadas a mano contra PCR especificaciones en términosde tiempo y espacio. En segundo lugar, evaluamos el efecto de PCR tuning delrendimiento en varios benchmarks multinúcleo y distribuidos. Finalmente, com-paramos el rendimiento en tiempo de ejecución de diferentes implementacionesparalelas para el mismo problema, utilizando PCRs como herramienta de desa-rrollo. También informamos de los resultados de aplicar PCRs en la enseñanza deprogramación paralela.

1.8.1. MetodologíaLas pruebas comparativas se realizaron en un servidor con 4 procesadores

(AMD Opteron (TM) Processor 6276 @ 2.30GHz) de 16 núcleos cada uno (64núcleos totales) con 128 GB de memoria ejecutando la versión de 64 bits deRedHat Linux Enterprise 6.7. Cada ejecución se repitió 10 veces y el tiempo deejecución total transcurrido y el uso máximo de la memoria residente (ResidentSet Size o RSS) se registraron según lo informado por el sistema operativo. Paracada conjunto de 10 mediciones de tiempo y memoria, se calculó la mediana.Para evaluar la variabilidad, también se calculó el rango de valores. La cantidadde núcleos a utilizar en cada ejecución se controló configurando la CPU affinitymask con la utilidad Linux taskset. Se usó la misma máscara de CPU para cadaejecución de k núcleos. Para cada programa medido, el tamaño del problemafue fijo (strong scaling), variando sólo la cantidad de núcleos utilizados.

1.8.2. Análisis de rendimiento básicoEn esta sección hacemos un análisis preliminar de la implementación CnC

actual de nuestra biblioteca de códigos PCR parallel. Dado que en esta etapa laCnC implementación de PCRs no está muy optimizada, el objetivo es proporcio-nar una visión general inicial del comportamiento del tiempo de ejecución paraevaluar la viabilidad del enfoque. Para el análisis de benchmarking seguimos lasrecomendaciones dadas en [33].

Para elaborar el análisis de desempeño realizado, se consideraron las siguien-tes dos preguntas principales:

Q1 ¿Cómo se compara el código PCR generado con las soluciones CnC codificadasa mano en términos de tiempo de ejecución y consumo de memoria?.

Q2 ¿Cómo afecta la generación automática de código CnC a los recursos de com-pilación y los tamaños de los ejecutables?.

Pregunta 1. Para comparar, elegimos dos programas CnC básicos: blackscho-les y producer-consumer. Ambas son tuberías con una y dos etapas de consumorespectivamente. En estos casos, se esperan tiempos de ejecución y uso de me-moria virtualmente equivalentes, ya que la estructura CnC sintetizada por lasreglas de generación de PCR code debe coincidir estrechamente con la implemen-tación codificada a mano existente. La Figura 1.10 resume los resultados. El


código de prefijo de 2 letras en los subtítulos denota cada punto de referenciade la siguiente manera: BS para blackscholes y PC para producer-consumer.Cada diagrama de caja compara PCR contra CNC en uso de tiempo / memoriacuando se utilizan 16, 32 y 64 núcleos.

tim

e (

s)

0.5

1

1.5

2

2.5

3

0.5

1

1.5

2

2.5

3

PCR16 CNC16 PCR32 CNC32 PCR64 CNC64

(a) BS time measuresRSS(KB)

40,000

60,000

80,000

100,000

120,000

140,000

160,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000


(b) BS RSS measurement

tim

e (

s)

4

6

8

10

12

14

4

6

8

10

12

14


(c) PC time measures

RS

S (

KB

)

0

20,000

40,000

60,000

80,000

100,000

120,000

0

20,000

40,000

60,000

80,000

100,000

120,000


(d) PC RSS mediciones

Figura 1.10: PCR vs CNC comparaciones de tiempo y memoria para diferentesnúmeros de núcleos informáticos disponibles.

Para blackscholes, los tiempos de ejecución son muy similares con una disper-sión muy baja para ejecuciones de 32 y 64 núcleos en ambas implementaciones.El consumo de memoria también es muy similar en ambas implementacionesque muestran que el consumo de memoria tiene cierta dispersión en ambas im-plementaciones independientemente del número de núcleos.

Se observan las mismas similitudes de tiempo y memoria para producer-consumer. La dispersión del tiempo sigue siendo notable para las ejecucionesde 32 núcleos y casi insignificante en las ejecuciones de 64 núcleos. Esto esconsistente con el hecho de que una tubería más larga necesitará más núcleos decomputación para que los elementos fluyan a través de ella a un ritmo constante.

Pregunta 2. Para responder a nuestra segunda pregunta, comparamos ta-maños de ejecutables finales. El mismo compilador, opciones del compiladory bibliotecas vinculadas se usaron para generar todos los ejecutables. Para res-ponder a la segunda pregunta, compilamos diferentes PCR pipelines de longitudescrecientes y tiempo de compilación medido, uso de la memoria de compilacióny tamaño del ejecutable final. Las mediciones muestran un aumento lineal en


tiempo y espacio, lo que sugiere que PCR el costo de síntesis e impacto en el tama-ño ejecutable binario escala linealmente con PCR size, validando la escalabilidaddel enfoque en términos de costos de recursos de generación de código.

Observaciones finales. En general, los benchmarks muestran que la im-plementación preliminar PCR tiene un rendimiento cualitativamente similar encomparación con las implementaciones CnC directas. Además, incluso si el ta-maño del archivo ejecutable aumenta al escribir una PCR, los tamaños estándentro de los límites razonables para código paralelo de propósito general. Jun-tos, estos resultados respaldan la aplicabilidad de la herramienta de generaciónde código CnC.

1.8.3. Estudio de rendimiento de N-QueensEn esta sección, comparamos las tres implementaciones de N-Queens presen-

tadas en la Sección 1.5. Usamos las versiones eureka, terminando el cálculo tanpronto como se encuentra el primer resultado. Establecemos el límite de tiempoen un minuto.

Primero, analizamos los tiempos de ejecuciónpara varias profundidades derecursión y tamaños de problema (Figura 1.11).

N=16

tim

e (

s)

0

10

20

30

40

recursion depth

0 1 2 3 4 5 6

feedbackloop iterate DC

N=27

tim

e (

s)

0

20

40

60

80

recursion depth

0 2 4 6 8 10 12 14

feedbackloop DC

Figura 1.11: Comparación de tiempo de ejecución entre implementaciones deN-Queens PCR con profundidad de recursión variable para diferentes N .

La implementación iterate con N = 16 muestra su mejor rendimiento conla profundidad de recursión 3 y alcanza el límite de tiempo para profundida-des de recursión superiores a 4. Para N = 27, esta implementación no finalizadentro del límite de tiempo para cualquier profundidad de recursión. La imple-mentación de dividir y conquistar muestra tiempos de ejecución mucho mejoresqueiterate. Observamos que al aumentar la profundidad de recursión se empeo-ran los tiempos de ejecución y, finalmente, se alcanza el límite de tiempo conprofundidades de recursión superiores a 3. Esta implementación no aumenta alaumentar el tamaño de los problemas, como se muestra en el gráfico N = 27.La implementación feedbackloop funciona mal para profundidades de recursiónpequeñas, alcanzando el límite de tiempo. Observamos que el aumento de laprofundidad de recursión mejora los tiempos de ejecución y, finalmente, logramejores tiempos que las otras implementaciones. La Figura 1.12 resume el mejortiempo de ejecución logrado en cada implementación para diferentes tamañosde problema, mostrando que solo feedbackloop escalas más allá de N = 29.


tim

e (

s)

0

20

40

60

NQ board size (N)

15 20 25 30


Figura 1.12: Resumen del mejor tiempo de ejecución en cada implementación paraN = 16.32

time(s)

0

1

2

3

0

1

2

3

FDB16 CNC16 FDB32 CNC32 FDB64 CNC64

Figura 1.13: Comparación de tiempo entre la implementación de feedbackloop

N-Queens y una implementación de CnC codificada a mano.

Como segunda medida de rendimiento, la Figura 1.13 muestra los tiemposde ejecución de la mejor implementación PCR que identificamos hasta ahora (enbase a feedbackloop) en comparación con el códigoCnCcodificado a mano que esparte de CnC suite. Esta implementación sigue el patrón work-pile. Ambas im-plementaciones usan la misma profundidad de recursión. Todas las ejecucionesse repitieron 20 veces y se realizaron con 16, 32 y 64 núcleos. Con 32 núcleos,la variabilidad se reduce, pero una pequeña ventaja es aún evidente en la im-plementación CnC. Finalmente, con 64 núcleos se observa poca variabilidad; conambas implementaciones que tienen tiempos de ejecución casi equivalentes.

1.8.4. Estudio de rendimiento de Count-WordsEn esta sección, realizamos una comparación de rendimiento de las dos so-

luciones PCR propuestas para el problema count-words en la Sección 6.3. Lasimplementaciones reales dividen el archivo en trozos de varias líneas. Las eje-cuciones se repitieron 10 veces y se registró la mediana de las mediciones. Seutilizó un archivo de texto de entrada de 111M líneas con un tamaño total de8GB.


tim

e (

s)

25

30

35

cores

16 24 32 48 64

by lines by words CnC by lines tuned

Figura 1.14: Comparación de las implementaciones de recuento de palabras alaumentar el número de núcleos disponibles.

La Figura 1.14 muestra tiempos de ejecución de implementaciones basadasen PCR para contar 9 palabras con un tamaño de fragmento de 10K líneas.También compara PCR versions con una CnC implementation pura de count-wordstomada de ejemplos CnC que usa un grafo de reducción paralela. Globalmente,count-words-by-lines muestra un mejor rendimiento que count-words-by-words, pe-ro la brecha tiende a disminuir para un mayor número de núcleos. Para másde 32 núcleos, las versiones basadas en PCR muestran tiempos de ejecuciónligeramente mejores que CnC. También analizamos el efecto de proporcionar in-formación de ejecución a nivel PCR para count-words-by-lines. De su especificaciónen la Figura 6.3, se deduce que las variables l y c se escriben tantas veces comose producen líneas, pero son leídas exactamente una vez por el consumidor ypor el reductor respectivamente. Una acción de ajuste simple es especificar estehecho, mediante dependencias, lo que permite que el entorno CnC libere memoriatan pronto como se haya leído cada valor. La Figura 1.14 muestra que la versiónajustada ofrece el mejor rendimiento.

Claramente, la cantidad de palabras para contar y el tamaño del fragmentousado para particionar el archivo de entrada afectan el rendimiento. La Figu-ra 1.15 muestra los tiempos de ejecución de implementaciones sin ajustes de ren-dimiento basadas en PCR con 64 núcleos, para contar 100 palabras con tamañosde segmentos cada vez mayores. El tiempo de ejecución de count-words-by-lines

aumenta constantemente con el tamaño del fragmento. Esto es coherente con elhecho de que busca cada palabra en cada fragmento de forma secuencial, porlo que los trozos más grandes reducen el paralelismo explotable. Por otro lado,count-words-by-words, que cuenta cada palabra en paralelo, aprovecha una mayorcantidad de palabras para contar, mientras que si hay demasiados fragmentospequeños se afecta negativamente su rendimiento. La figura muestra que su tiem-po de ejecución baja hasta un tamaño de fragmento de 100 K, sin más mejoras.Para tamaños de fragmentos de 10M, count-words-by-lines muestra una dismi-nución abrupta del rendimiento en comparación con count-words-by-words. Estoes consistente con el hecho de que para este tamaño hay menos fragmentos quela cantidad de procesadores disponibles. En este escenario count-words-by-words

se beneficia de la dimensión extra del paralelismo.


tim

e(s

)

50

100

150

chunk size

10k 100k 1M 5M 10M

by linesby words

Figura 1.15: Comparación de dos implementaciones de recuento de palabras condiferentes tamaños de fragmentos.

1.8.5. Uso de PCR y CnC en la enseñanzaEn esta sección presentamos una experiencia de uso de PCRs como herra-

mienta para enseñar programación paralela basada en patrones a estudiantesde grado en Ciencias de la Computación. Comparamos el resultado de usar PCRs

puros contra usar CnC para proporcionar una solución paralela a un problema.

Contexto. Durante el segundo semestre de 2017, se impartió un curso elec-tivo de programación paralela 5 para estudiantes de pregrado. El lenguaje deespecificación FXML se enseñó como una forma formal de especificar el paralelismopotencial. Además, los PCRs se presentaron junto con el modelo de programaciónparalela CnC. Los estudiantes del curso no tenían experiencia previa con PCRs

o CnC, y solo tenían conocimientos básicos de programación concurrente (pri-mitivas de hilos y sincronización) de un curso previo obligatorio de SistemasOperativos.

Actividad. El proyecto final del curso consistió en elegir un entorno de pro-gramación para diseñar una solución paralela para un problema elegido por losestudiantes, y describir su semántica paralela utilizando especificaciones FXML .Dos estudiantes eligieron CnC y PCR para paralelizar el problema de contar nú-meros primos en un intervalo dado. Implementaron el algoritmo de verificaciónde primos Miller-Rabin en su variante determinista, utilizando el algoritmo dela Criba de Eratóstenes como acelerador: teniendo un conjunto de todos losnúmeros primos de 2 hasta algún K constante, se puede verificar la primalidadde todo número en el intervalo K −K2 buscando simplemente algún divisor enel conjunto de primos conocidos.

Los estudiantes eligieron realizar verificaciones de primalidad en paralelopara cada número en el intervalo a verificar, mientras mantienen secuencial elalgoritmo de Miller-Rabin en sí mismo.

Comentarios de los estudiantes. Después de la experiencia y la revisióndel rendimiento de las implementaciones, discutimos con los estudiantes sobresu experiencia de desarrollo. Comparando ambos entornos, encontraron a los

5Lugar: Departamento de Computación, FCEyN, UBA.

1.9. CAPÍTULOS 10 Y 11 - CONSIDERACIONES FINALES 31

PCRs más fácil de entender y escribir, incluso afirmando que ‘’Es como escribirfunciones simples ”. Consideraron que el desarrollo del grafo CnC era más confusoy propenso a errores.

Conclusiones de la experiencia. Esta experiencia preliminar muestra resul-tados prometedores y sugiere que los PCRs se pueden usar para enseñar programa-ción paralela. Creemos que sería valioso realizar un estudio con más estudiantespara evaluar las capacidades educativas de los PCRs.

1.9. Capítulos 10 y 11 - Consideraciones finales1.9.1. Trabajos relacionados

Los PCRs están relacionados con los esqueletos algorítmicos [27] y los mo-delos de procesamiento de flujo [48]. Siguiendo las conclusiones resumidas en[49], analizamos las capacidades de los esquemas actuales de flujo y esqueletosalgorítmicos con respecto a: a) exposición de paralelismo potencial de tareas ydatos; b) exponer la presencia de paralelismo de ventana deslizante; c) prevenirel uso de filtros con estado; d) describir naturalmente topologías de computacióncomplejas; y e) mantener la plataforma de descripción de problemas paralelosagnóstica a la plataforma concreta de ejecución.

En lo sucesivo, nos referimos a los componentes atómicos de un entornode programación como Single / Multiple Input (SI/MI) o Single/Multiple Out-put (SO/MO) con respecto al número de canales de entrada / salida que cadacomponente puede tener.

Los PCRs cubren los requisitos mencionados dado que a) su semántica deejecución que permite instancias simultáneas del mismo consumidor y encade-namiento de consumidores, junto con dependencias FXML que describen el para-lelismo de datos; b) proporcionan operaciones de lookahead/lookbehind en lasvariables PCR para permitir la expresión de ventanas deslizantes y cómputos sten-cil; c) usan funciones puras en el código básico; d) admiten PCRs anidados conparámetros nombrados para las entradas de productor, consumidor y reductor(MISO), permitiendo conexiones complejas de PCRs manteniendo un compor-tamiento funcional; y e) separan las especificaciones PCR de la plataforma deejecución.

Entornos de esqueletos algorítmicos. Limitamos la discusión a aquellosque están más relacionados con PCRs.

Quaff [24] declara una topología de coordinación de cómputo concurrentepor la composición de esqueletos básicos. Utiliza funciones de código heredadorestringidas a un solo parámetro de entrada (SISO). Otra limitación es que laconstrucción farm requiere fijar en tiempo de compilación la cantidad de proce-sadores que se utilizarán en el tiempo de ejecución. En [23], se proporciona unasemántica de CSP [32] de Quaff sin mostrar la corrección de la implementacióncon respecto a esta semántica.

Muesli [14] admite paralelismo de tareas mediante la construcción de unatopología de procesos conectados y paralelismo de datos mediante el uso deestructuras de datos distribuidos. La topología se construye componiendo ins-tancias de objeto (polimorfismo dinámico) de esqueletos básicos (Pipe, Filter,


Farm) y algorítmicos (DivideAndConquer, BranchAndbound). Toda la comuni-cación en estructuras paralelas de datos es explícita, por lo que la descripción delproblema tiene lógica de comunicación mezclada. No se proporciona un modeloformal.

MapReduce [19] define un modelo de computación simple junto con una im-plementación de referencia que maneja la distribución del trabajo. Una extensión[21] de la implementación MapReduce permite cálculos iterativos, similares a laconstrucción PCRs iterate. Los PCRs extienden MapReduce con los conceptos decomposición por anidación, iteración, componentes del productor y look-ahead/ look-behind "nativo".

El Orleans Skeleton Library [36] sigue el modelo de computación en paralelosincrónico masivo (BSP) [51] con átomos básicos de SISO. Su semántica se for-maliza mediante la reescritura de términos, pero no se proporciona una relaciónformal con el modelo de ejecución real (MPI).

SkePU [22] es una biblioteca de esqueletos orientada a GPU / multinúcleo.Admite cálculos MISO pero limita el número de parámetros de entrada a 3. Nose proporciona ninguna semántica formal de la ejecución del esqueleto.

STAPL [57] se centra en la reutilización composicional de esqueletos y pro-porciona una sintaxis específica del dominio para describir las interconexiones decomponentes complejos. Es compatible con iteración, pero carece de semánticaformal.

Resumiendo, a nuestro leal saber y entender, la mayoría de los esquemas al-gorítmicos están restringidos a componentes SISO, carecen de semántica formaly de soporte de iteración y de cálculos eureka, y no proporcionan un soporteespecífico para las operaciones look-ahead y look-behind.

Modelos de programación de flujos de datos. StreamIT [50] es un len-guaje de programación basado en tuberías de filtros. Cada filtro tiene una en-trada y una secuencia de salida. La comunicación se logra mediante operacionespush, pop y peek (mirar hacia adelante) en los canales. Las topologías com-plejas se pueden ensamblar utilizando FeedbackLoop (iteración) y conectoresSplitJoin que primero separan y luego combinan elementos de/a una secuenciaa/de muchos. Las principales diferencias con PCRs son: a) las operaciones popson destructivas en los canales, lo que impide que los filtros utilicen el histo-rial completo de valores (look-behind); b) falta de soporte directo de filtros conmúltiples flujos de entrada independientes; y c) SplitJoin está restringido a laspolíticas duplicate y round-robin, mientras que la construcciónPCR connect admitecualquier política definida por el usuario.

FastFlow [2] es un marco de programación de flujos de dato multicapa conuna capa inferior de componentes SISO conectados por canales anónimos utiliza-dos para proporcionar en la capa intermedia componentes MISO y MIMO; unacapa superior de esqueletos completa el entorno. No proporciona capacidadesde lookahead/lookbehind.

RISC-pb2l [1] es un conjunto de bloques de construcción paralelos imple-mentados sobre FastFlow que permite la construcción de complejos patronesde computación en paralelo. RISC no modela conexiones de datos complejasdirectamente porque los canales son anónimos, y hereda la falta de soporte deFastFlow para mirar hacia adelante/atrás en sus canales de entrada.

S-Net [29] es un modelo de flujo de datos que consta de cajas SISO sin estado

1.9. CAPÍTULOS 10 Y 11 - CONSIDERACIONES FINALES 33

interconectadas por flujos de registros que forman un grafo de computación ací-clica. El subtipado de se proporciona para habilitar la composición y adaptaciónde cuadros a diferentes entornos. El modelo S-Net no proporciona capacidadesde lookahead/lookbehind en sus flujos de entrada, y no puede modelar fácilmenteconexiones de datos complejas porque los canales son anónimos.

En resumen, ninguno de los marcos citados es totalmente compatible con losconceptos de lookahead/lookbehind. Además, se basan en conexiones anónimasque dificultan la especificación de interacciones complejas entre componentes.

Otros modelos. Multi-BSP [52] es una extensión de BSP [51]. Este modelode árbol multinivel tiene como objetivo describir las capacidades de concurren-cia de una arquitectura hecha de una combinación de elementos de hardwarey software. Cada nivel define los parámetros de tiempo de ejecución, como elnúmero de procesadores, los costos de sincronización / comunicación y los ta-maños de caché. Su objetivo es escribir algoritmos concurrentes conscientes delos parámetros arquitectónicos con el fin de lograr una implementación óptimapara cualquier arquitectura y para cualquier valor concreto de sus parámetros.Una diferencia importante con PCRs es que Multi-BSP impone varios requisitos alas arquitecturas para soportarlo, mientras que PCRs son completamente ajenosa la arquitectura.

Dryad [35] modela cómputos de grano grueso como grafos acíclicos dirigidoscon fragmentos secuenciales de código en los nodos. El lenguaje de descripciónes plano aunque se proporciona un operador de composición de gráfico. El en-foque principal está en el rendimiento de la implementación y la adaptación delprograma a los recursos disponibles.

Claramente, ambos modelos son objetivos interesantes para la generaciónautomática de código a partir de especificaciones PCR.

1.9.2. Conclusiones y trabajos futurosResumen de contribuciones

Modelos formales. Desde un punto de vista teórico, definimos un patrónparalelo con soporte de composición que combina conceptos como operacionescolectivas, eurekas, iteración, recursión y programación de flujos de datos. For-malizamos su semántica abstracta y propusimos una concreta a través de unatraducción formal de PCRs intoCnC. Ilustramos con varios casos de estudio de di-ferentes dominios de aplicaciones que los PCRs pueden facilitar la escritura deprogramas paralelos.

Herramientas de programación en paralelo. En lo que respecta a laherramienta, desarrollamos un entorno de programación que consiste en (i) unabiblioteca para escribir PCRs independientes de la plataforma, y (ii) un motor degeneración de código basado en metaprogramación de plantillas para traducirlasen implementaciones basadas en CnC. Vale la pena mencionar que los templatesC++ de PCR están diseñados para mejorar la portabilidad en el sentido de quediferentes plataformas de ejecución de destino podrían ser utilizadas para lageneración de código.

El entorno proporciona una generación automatizada de código paralelo ba-sada en teoría a partir de descripciones estructuradas de alto nivel independien-


tes de la plataforma. Los resultados experimentales proporcionaron evidenciade que el código sintetizado puede lograr desempeños que son comparables alos de los programas de bajo nivel, no estructurados y dependientes de la pla-taforma. Esto es una señal de que el enfoque podría ser factible en la práctica.Desde una perspectiva de ingeniería de software, resultaría en un menor esfuer-zo de codificación, más confiabilidad y un prototipado más rápido de variasimplementaciones.

Desafíos futuros

Hay varias oportunidades de desarrollo para los PCRs.

Extensiones formales del modelo. Planeamos generalizar el uso de anida-miento PCR en los productores y reductores. Con respecto a connect, planeamosextender la semántica FXML para permitir la descripción de conexiones más ricasentre PCRs en particular, unir de manera no determinista las secuencias de salidade dos o más PCRs en la entrada corriente de un tercero.

Otra extensión futura es la capacidad de especificar cálculos en los cualeslas relaciones (i, i) dentro de PCR pueden relajarse a relaciones biyectivas parapermitir la reindización de elementos de la salida en función del orden de gene-ración en lugar del orden de indexación impuesta por el productor. Esto permiteque el reductor comience a procesar los artículos tan pronto como se producen,lo que es particularmente útil para implementar cálculos eureka más rápidos.

Extensiones del entorno de alto nivel C++ . Una extensión prevista delentorno consiste en habilitar la composición asincrónica de PCRs, permitiendoque el llamante y el destinatario puedan proceder en paralelo al no bloquear elanterior hasta que necesite el resultado de este último o retorno (por ejemplo,como los Futures [53] , Cilk [10]). Otra extensión futura de PCRs C++ es proporcio-nar una forma general de escribir redes de PCRs por medio de una construccióngenérica connect con capacidades avanzadas de composición.

Extensiones de implementación CnC. La implementación CnC actual noaprovecha totalmente el uso de tag tuners CnC que pueden ayudar a implementarparticiones más eficientes de rangos de tags para acelerar cálculos de grano grue-so. Tenemos que hacer una evaluación exhaustiva de las capacidades distribuidasde la implementación PCR basada en CnC.

Implementaciones alternativas. Otras direcciones futuras de investigaciónincluyen experimentar PCRs en otras plataformas, como clusters distribuidos,arquitecturas de muchos núcleos y GPU, así como implementar el concepto PCR

en otros lenguajes de programación.

Estudios con usuarios reales. Ampliando la experiencia reportada en lasección 1.8.5, planeamos realizar experimentos que midan la facilidad de usode PCRs comparados con otros marcos de programación paralelos, tanto con es-tudiantes como con programadores experimentados. Uno de los objetivos esevaluar la viabilidad de PCRs como herramientas para la enseñanza simultáneade programación.

Part I

Prelude

35

Chapter 2

Introduction

2.1. The limits of single-core computingHardware design have shifted from trying to improve the speed of a single

processor to increasing the number of available processing cores. Multiple issuesrelated to the physics of processor design made the hardware industry to followthis direction [5]. Figure 2.1 depicts the progress made since 1970 until the endof the last decade in CPU technology.

Figure 2.1: Intel CPU generations and their evolution in clock speed and number oftransistors (source: [31]).

Right after the turn of the century a peak in per-core processor raw speed wasreached, and the industry followed the alternative of grouping multiple CPUs

37

38 CHAPTER 2. INTRODUCTION

in the same hardware chip in order to keep increasing total available computingpower.

In the software side, the problem of efficiently taking advantage of parallel-processing power is challenging. Operating systems have used all available CPUcores to run multiple tasks concurrently since the beginning of multiple corecomputing. However, this mechanism does not address the frequent scenarioof a single task running a costly computation and which performance couldbe improved if its work is split across multiple CPU cores and/or distributedcomputation nodes.

Many factors add extra complexity to the already difficult issue of developingcorrect and efficient concurrent software in a cost-effective way. Besides the well-known pitfalls of parallel programming, such as deadlocks, data races, etc. [40],the paradigm shift towards multiple core hardware demands for making use ofdifferent parallel patterns and execution models, as well as existing legacy codewhich cannot always be rewritten from scratch [11].

2.2. Parallel software developmentA key aspect to handle parallel programming complexity is the use of abstract

program descriptions. According to [9], parallel programming methodologiescan be categorized by their level of abstraction as follows. Thread manage-ment and synchronization primitives together with inter-processes communica-tion (IPC) mechanisms (e.g., MPI [30]) are considered to be low level abstrac-tions. Language extensions (e.g., OpenMP [17] and Cilk [10]) and frameworks(e.g., TBB [46], TPL [38] and CnC [12]) provide a middle level of abstraction byrelieving part but not all of the coordination and synchronization efforts fromthe programmer. Similarly, we could argue that emerging parallel programminglanguages such as X10 [47] and Chapel [13] belong to this category.1 Pattern-based parallel-software design [41, 16, 27], also known as structured parallelismor algorithmic skeletons, provides high level parallel programming abstractions.They consist of common constructs that hide from the programmer all low-levelcoordination and synchronization mechanisms which are necessary to performthe actual parallel execution. This is done by mapping, at compile- and/orat run-time, the high-level abstractions into middle/low-level libraries and/orlanguage constructs. It has been shown that resorting to structured parallelismallows harnessing the computing power of heterogeneous architectures [26].

Numerous works advocate using pattern-based parallel programming [27].Among the most recent and successful ones we should cite [24, 19, 36, 3, 22, 1,55]. Despite their contribution to the field, it is worth observing that they havesome drawbacks. First, they provide little or no formal foundations. Notableexceptions are [36, 3, 23], which give abstract semantics to the patterns, butdo not establish any formal relationship with the concrete underlying executionmodel. This decoupling inhibits proving whether runs of the actual programindeed correspond to behaviors defined by the high-level abstraction. Second,they provide no easy means of combining different patterns (or instances of thesame pattern) in a compositional way, a problem which has been identified andpartially addressed recently in [55].

1An in-depth discussion of parallel programming languages is out of the scope of this thesis.

2.3. CONTRIBUTIONS 39

The platform-independent parallel software design approach has been fa-vored in the design of FXML [6, 54], a formal specification language for express-ing parallelism, together with control and data dependencies. In FXML, parallelcomposition does not entail physical concurrent execution at runtime, but onlylogical concurrency. Control and data dependencies can be annotated with prop-erties to restrict parallelism because of timing, precedence or data constraints.FXML does not rely on any concrete execution model of concurrency, so it allowsfor multiple implementations of a program. To achieve this, FXML provides abasic set of primitives which can be extended with constructs for expressingmore concrete mechanisms and execution models of concurrency (e.g., threads,shared memory and condition variables, message passing, etc.). An FXML compi-lation chain is a sequence of program transformations which are only allowed torestrict the degree of parallelism. FXML automatic multi-dimensional indexing ofassignments allows specifying complex data and control dependencies betweendifferent code segments. This key feature makes it well suited for capturingvarious classes of parallel computations. Clearly, code-generation based on for-mal transformations of FXML specifications ensure implementation correctness.Nevertheless, describing complex parallel computations in basic FXML may beoverwhelming as it involves correctly identifying and specifying all data andcontrol dependencies. Indeed, this is counterproductive as the ultimate goal isto enhance software quality and productivity by easing parallel programming.

2.3. ContributionsThe aforementioned situation creates a need of techniques and tools that

ease building parallel software in a cost-effective way. This thesis looks forwardto contributing in that direction by following a theory-based practical approach.Quoting A. Ranta, this means that “the ultimate goal is to write [parallel] pro-grams that work, but the best way to achieve it is via theoretical thinking” [45].

More precisely, our work relies on two principles. First, parallel softwareshould be designed in a platform-independent way, so as the same piece of soft-ware could end up running either in a many-core server, a cluster of inexpen-sive nodes, or a processor grid. Second, tuning for a concrete execution modelshould be done by formally translating such a platform-independent design intoa specific solution. Platform-specific and environment characteristics are to befactored in at relevant phases of a formally-grounded code generation process.Relying on sound theoretical bases guarantees correctness.

To do so, this work starts by defining a parallel programming pattern, calledPCR [43], which describes computations performed concurrently by communicat-ing Producers, Consumers, and Reducers, each one being either a basic function(business logic), or a nested PCR. It combines in a single and composable pat-tern several parallel programming concepts like collectives [30], eureka compu-tations [34], unbounded iteration and recursion, and stream programming [48].The semantics of PCRs is formalized using FXML [6, 54], which does not rely on anyconcrete execution model of concurrency, enabling multiple implementations ofa program. PCRs are shown to behave as functions which ensures seamless com-position. With relevant case studies, we illustrate how PCRs can ease parallelprogramming in practice. To enable writing actual programs, we designed andimplemented a platform-agnostic C++ template library supporting PCRs. As a

40 CHAPTER 2. INTRODUCTION

second step, we propose a sound and complete formal translation of PCRs into anexecutable parallel model, namely Concurrent Collections (CnC) [12]. To com-plete the contribution, we developed a code-generation tool which encompassesa template rewriting engine for translating PCRs into CnC-based implementations.

2.4. StructureThe document is structured in four parts. The remaining of this first part,

comprising the current chapter 2 and the following chapter 3, introduces theparallel programming problem as a motivator, the concept of pattern-basedparallel programming, and the formal FXML specification language.

The second part, comprising chapters 4 and 5, is devoted to introducingand describing syntax, semantics and applications of the PCR pattern. chapter 4presents the PCR pattern. It starts with a high-level description together with amotivating example. Then, the semantics of PCRs is formalized with FXML. chap-ter 5 discusses several extensions to the basic PCR pattern which allow composingPCRs in a number of different ways beyond the basic PCR computational model.chapter 6 explores several case studies of increasing complexity to illustrate howPCRs express commonly used parallel-programming patterns [41]. In particular,it discusses PCR specifications of two mid-size, data-analytics case studies, namelyQuinlan’s ID3 algorithm [44] and sentiment analysis [39].

The third part of the thesis, spanning chapters 7 to 9, explains the implemen-tation of PCRs over a concrete execution model, together with its associated tooland an empirical evaluation of it. Chapter 7 briefly introduces CnC, proposingan implementation of PCRs in terms of CnC and proving its correctness. Chapter8 describes a C++ template library which provides a programming framework forPCRs. It also sketches a concrete CnC-based implementation. Chapter 9 evaluatesthe current PCR implementation based on CnC and applies the implementation asa tool to explore the achievable parallelism of several case studies.

The last part of this thesis, composed of chapters 10 and 11, discusses relatedand future work.

Chapter 3

Preliminaries

3.1. Pattern-Based Parallel ComputingHistory. The field of pattern-based parallel programming born with Cole’sbook [16]. It presents and motivates the idea of higher order functions ab-stracting the parallel execution environment from the programmer by providinggrid implementations of four common parallel programming patterns. Concur-rently with Cole’s approach, Kung [37] proposed some more parallel computingpatterns in the context of a single-dimensional processor array. The next twodecades saw many contributions to the field. Darlington et al. [18] contributedthe Structured Coordination Language (SCL) combining a set of predefinedalgorithmic skeletons with functional programming semantics, allowing for op-timizing transformations like map fusion and distribution; their work focused onFortran+MPI as execution platforms. Around the same time, Bacci proposedthe P3L [7] skeleton-based coordination language targeting the C language andthe MPI platform. The STAPL [56] platform has been in development since1998 and is proposed as a possible adaptive parallel replacement of C++’s stan-dard template library. We refer the reader to [27, 41] for an in-depth survey ofthe history of algorithmic skeleton frameworks.

Common skeletons. Algorithmic skeletons are usually classified as data par-allel, task parallel and resolution patterns [27]. In what follows we describe thewell-known algorithmic skeletons as identified by the literature. Data parallelpatterns describe computation patterns centered on distributing data items tobe processed in parallel among several processing units, usually in an indepen-dent way. Task parallel patterns describe how to distribute different computa-tion tasks to be executed among several available processing units. Resolutionpatterns describe specific problem-solving computation schemas like Divide &Conquer, Eureka computations and Branch & Bound search.

41

42 CHAPTER 3. PRELIMINARIES

input

worker farmFarm. The farm pattern describes com-putations where the input is partitionedand distributed among a set of workers com-puting the same operation in a muti-core ordistributed architecture.

Parallel Pipeline

input stage 1 stage 2 stage N output

Pipeline. The parallel pipeline patternextends the classical pipeline pattern withstages supporting parallel execution ofconcurrent inputs. This pattern is applied both to low-level CPU instructionexecution and to high-level task pipelines.

Task 1

Task 2

Task 3

fork

main task

do work

join

nish tasks

Fork/Join

spawn

tasks

Fork/Join. This pattern models computa-tions where a task spawns several others and,at some point, it waits for all of them to fin-ish before proceeding. It can be applied ina recursive way, iterating the forking of newtasks until the problem to solve becomes sim-ple enough. Recursive fork/join effectivelycorresponds to the parallel Divide & Conquerpattern.

input set

output set

parallel map apply function in parallel

Parallel Map. A map operation consists in ap-plying the same function to a set of input items.A parallel map performs all those applicationsin parallel. This is an example of a simple andwidely used parallel pattern. Note that a farmcan implement the parallel function application.

inputs to combine

result

parallel reduce

combination

binary

operation

Parallel Reduce. A reduce operation (alsonamed fold, accumulate or aggregate, among othernames) consists in combining a set of input itemsusing a given binary operation. If the operation isassociative, the reduce computation can be donein parallel by dividing it in stages of independentbinary applications. Typical reduce operationsare aggregate calculations (e.g., sum, average) ona set of numbers. A variant of reduce, scan, produces also all the intermediateresults.

3.1. PATTERN-BASED PARALLEL COMPUTING 43

item 1

item 2

item 3

item 4

worker 1

worker 2

input

shared work pile

workers

Work pile. A workpile consists of ashared pool of work items and a farm ofseveral workers consuming them. Eachworker consumes the first available itemand proceeds on to the next one until nomore work items are available – finish-ing the computation. New work itemsmay be provided either by the externalenvironment or as a one or more resultsof a worker’s execution. The resulting computation is not regular and may di-verge depending on workers’ behaviors.

broadcast

a a

a

scatter

a

b c

gathera

bc

Collective operations. This pat-tern was introduced in the MPI [30]framework. The gather operation in-volves one computing node collectingdifferent information pieces from sev-eral other nodes in order to performa computation. Conversely, a scatter operation is the distribution of a set ofinformation pieces among several receiver nodes. Finally, the broadcast consistsin relaying the same item to several receivers.

1. start search

tasks

3. terminate

other tasks

2. post

solution

Eureka

computation

Eurekas. An eureka computation [34] isany kind of computation in which a solutionspace is searched by several workers orches-trated by a parent task. When any workerfinds a viable solution as defined by the prob-lem nature, an eureka event is triggered: theparent task terminates the now-redundanttasks and proceeds with the solution posted at the eureka event. This pat-tern is usually found in search and optimization procedures.

1 2 3

1

2

3

ai+122 = f(ai

12, ai21, ai

23, ai32)

Stencils. Commonly found in computer simula-tions of scientific and engineering applications, theseiterative computations update a 2D or 3D array of el-ements according to some fixed pattern which is calleda stencil, usually a variation of the Von Neumann andMoore neighbourhoods.

Using skeletons in development. The list of patterns above depicts thebroad spectrum of choices available to a developer. There are several existingparallel programming frameworks providing these algorithmic skeletons in someform. It is not always obvious which of these patterns is best for a problem athand. This thesis intends to contribute to simplifying this task by providing anew pattern combining other related patterns, with emphasis on a declarativeand concise syntax and a formal foundation with clear semantics.


3.2. FXML Parallel SpecificationsAs an introduction, we informally provide FXML syntax and semantics as the

formal tool for the contributions of this thesis. The reader is referred to Ap-pendix A and to [6, 54] for an in-depth and formal definition.

Syntax. An FXML specification describes parallel computations by defining theexpected behavior of any valid implementation of it as a set of partial orders.The body of an FXML specification is composed of blocks called pnodes. The basicpnode-types are variable (var) and function (fun) declarations, assignments, andbasic code. Basic pnodes are executed atomically. Pnodes can be combined withsequential execution constructs: seq, while, for, if-then-else, and with parallelexecution constructs: par (parallel code blocks) and forall (parallel for loop).Pnodes can be labeled. Pnodes inside loops (for, while, forall) are automaticallyand dynamically indexed.

Parallelism can be restricted by specifying data dependencies. The statementdep Q(i) → P(i) specifies a data dependency between occurrences of assignmentslabelled Q and P, meaning that the i-th occurrence of P must use the value ofthe variable, say x, written by the i-th occurrence of Q. We call this an (i, i)dependency. FXML supports dependencies of the form (i, g(i)), where g is an affinefunction. Besides, FXML provides some predefined types of dependencies: weak,i.e., the read value could be any written one, strong, i.e., every written valuemust be read at least once, and bijective, i.e., every written value must be readexactly once.

Data and control dependencies determine a partial order of the executionof statements. The semantics of a pnode is a (possibly infinite) set of (possiblyinfinite) partial orders, called executions, consistent with the conjunction ofconstraints imposed by dependencies.

Semantics. FXML semantics describes the full history of assignments. This isachieved by keeping track of all values carried out by a variable through dynamicand automatic indexing of each assignment. This property is leveraged into asyntactic mechanism by enabling FXML pnodes to refer to specific indexes of avariable.

Given a computable function g, the operation x[g] on variable x, refers tothe value xi+g(i) assigned to x by the assignment indexed i + g(i), where i isthe dynamic index given by the semantics to the inner-most pnode where theexpression x[g] appears. This allows stream programming operations look-aheadand look-behind, to be used on FXML variables. For example, x[-1] (resp., x[1])references the value of x at the previous (resp., next) index. Whenever needed,we will use the syntax x[0] to make it clear we are specifically referring to thevalue xi where i is the index of the current context, as opposed to the completehistory of values.

The behavior of FXML variables is further explained in section 8.1 along withthe implementation of look-ahead and look-behind.

3.2. FXML PARALLEL SPECIFICATIONS 45

Example 1 (Fibonacci Primes). Figure 3.1 shows an FXML specification of aprogram that counts the number of primes among the first N Fibonacci numbers.Program indentation is only for pretty printing.

1 fun fib(f, i) = if i < 2 then 1 else f[-1] + f[-2]

2 fun divisors(j) = if j==0 then 2 else 3+2*(j-1)

3 fun sqrt, not_divides, count, and

4 dep P(i) -> Q(i)

5 dep D(i,j) -> B(i,j)

6 dep B(i,j) -> C(i)

7 dep C(i) -> R

8 var p, c, r

9 par

10 forall (i = 0; i < N+1; i++)

11 P: p = fib(p, i) // allFibs producer

12 forall p

13 par

14 Q: forall (j=0; j <= (sqrt(p)-1)/2; j++)

15 D: d = divisors(j) // divisors producer

16 forall d

17 B: b = not_divides(d, p)

18 C: c = and(b)

19 R: r = count(c)

Figure 3.1: Fibonacci primes counter in FXML.

Figure 3.2 depicts the schematic diagram of its semantics. Notice that onlyone partial order, actually the less restrictive one, is shown. Arrows model dataand sequential control dependencies. Ocurrences of pnodes are indexed. Forinstance, P i represents the ocurrence of the i-th assignment to variable p, orequivalently, the value pi. Indeed, indexes are vectors whose dimension increasesalong with loop nesting.

The dependency B(i,j) → C(i) entails the i-th evaluation of (basic function)and depends on all values of variable b with index (i, j), that is bi,j, where j ∈ Ji,and Ji = [0 . . . (sqrt(pi)-1)/2]. Assuming and computes the conjunction of allthese values, the value ci is

∧j∈Ji

bi,j.Similarly, the dependency C(i) → R entails the evaluation of (basic function)

count depends on all values ci, where i ∈ [0 . . . N). Assuming count computes thenumber of all these values which are true, the value of this occurrence of r is∑i∈[0...N) if c

i then 1 else 0.


Q0

B0,K0

FN−2

R

FN

CN

FN−1

...BN,2

...DN,2

C0 ...

...D0,2 D0,K0

...B0,2

...

F 0

DN,K0

BN,KN

... QN

Figure 3.2: Diagram of Fibonacci primes counter semantics.

Part II

Design of a ConcurrentPattern

47

Chapter 4

TheProduce–Consume–ReducePattern

In this chapter we introduce the first contribution of this Thesis: the PCR

parallel pattern, first as an informal concept and later as a formal definitionusing the FXML specification language and semantics as support.

4.1. Informal presentationThe PCR pattern aims at expressing computations consisting of a producer

consuming input data items and generating, for each one of them, a set of out-puts which is consumed by several consumers working in parallel. Their outputsare finally aggregated back into a single result by a reducer. The goal of thepattern is to emphasize the independence between the different computationsin order to expose all parallelization opportunities.

Producer Reducer


Consumer kConsumer kConsumer k


Figure 4.1: The PCR pattern.

Topology. Figure 4.1 depicts the general form of a PCR. Arrows represent dataconnections in a PCR . Full ones model the possibly multiple input sources andthe single output channel to the external environment. Dashed arrows denote

49

50 CHAPTER 4. THE PRODUCE–CONSUME–REDUCE PATTERN

internal data channels. Notice that all PCR external inputs are available to anyinner component. Data cycles between internal components are not allowed1:the network is itself a directed acyclic graph (DAG) of which any topologicalsorting has the producer and the reducer as as the first and last items respec-tively.

Data flow. Information flow inside a PCR is as follows. For each input dataitem, the producer component generates a set of output values; each one beingimmediately available for reading. Consumer components read values from theouter scope and from the private data channels to perform their computations.At the end, a reducer component combines values from one or more data sourcescoming from the producer and one or more consumers, generating a single outputitem for every external input item processed by the producer. Reads in datachannels are non destructive; the same value can be read by any consumer andby the reducer. No input is ignored by any component – all dashed arrows inthe graph carry the same number of data items to be read.

Concurrency. Producer, consumers, and reducer work in parallel subject tothe existing data dependencies: all input items must be available for a pro-ducer, consumer or reducer instance in order to perform its calculation. Eachproducer, consumer and reducer can potentially spawn as many parallel exe-cution instances as necessary for any specific workload. Both the nature of anexecution instance (local and/or remote thread or process) and the schedulingpolicy are defined by each PCR underlying implementation.

Relationships with other parallel patterns. PCRs combine several parallelprogramming concepts.

Each consumer can spawn as many instances as needed to handle the givenworkload, behaving as as farm.

Putting several consumers in sequence conforms a parallel pipeline witheach consumer as a worker.

Both the PCR as a whole and each internal consumer as a unit can beregarded as a parallel map operation transforming the input stream.

Depending on which implementation is chosen and on the nature of thecomputation, the reducer could perform a parallel reduce operation.

The sequence of first the producer output triggering the spawning of con-sumer instances and then the aggregation of their outputs by the reducercan be regarded as an instance of the Fork/Join model, having the reduceras the synchronization point.

The reducer could finish its computation before all its input items areprocessed or even before the producer or the consumer instances finishtheir work; enabling for eureka computations.

1Cyclic composition through recursion is duscussed in chapter 5

4.2. PCR FORMAL DEFINITION 51

The distribution of the producer output to instances of the same consumeris a scatter operation; availability of the same produced item to all con-sumers is analogous to a broadcast; the combination of all inputs by thereducer is a gather operation.

allFibs countN Fi isPrime(Fi)b primes count

among rst N

PCR for countFibPrimes

isPrimeisPrimeisPrime

divisors anddi

FF

biisPrime(F)

not dividesnot dividesnot_divides

isPrime PCR expansion

Figure 4.2: PCR for counting Fibonacci primes, depicting a nesting opportunity

Example 2 (Fibonacci primes). We illustrate the PCR concept by specifying aprogram that counts primes among the first N Fibonacci numbers. Figure 4.2(top) shows the PCR countFibPrimes. The producer allFibs generates the sequenceF1, F2, ..., FN of Fibonacci numbers. Each instance i ∈ [1 . . . N ] of the isPrime

consumer checks, in parallel, the primality of Fi, resulting in the unorderedoutput of indexed boolean values isPrime(Fi). The reducer count counts the num-ber of those which are true. Figure 4.2 (bottom) shows the PCR of consumerisPrime which checks in parallel all possible di divisors. The and reducer com-putes the conjunction of all the bi outputs by the parallel instances of consumernot_divides. This is an example of a consumer reading the producer output andthe PCR input as well. The ability of nesting PCRs allows reusing components andcontrolling the desired grain of parallelism in a simple way.

The PCR countFibPrimes admits parallel execution at several levels. First, manyinstances of isPrime could be executed simultaneously as allowed by the availableprocessing engines and the Fi production rate. Second, since the count reduceoperation is associative and commutative, it could also be parallelized.

It is worth noticing that, even if at PCR scope the producer and reducer compo-nents are single instances, PCR nesting allows for concurrent execution of multipleinstances of the same producer/reducer pair. In this example, there are as manylogical instances of the divisors producer and of the and reducer as the numberof Fi to be processed by consumer isPrime in the outer scope.

4.2. PCR Formal DefinitionTo give both a specification language and a formal semantics to the PCR

concept, we choose the FXML language and propose syntactic extensions to it inorder to ease writing PCR instances.

4.2.1. SyntaxThe syntax of PCRs is defined in Table 4.1. var is a variable name and param

is a formal parameter. We refer as basic functions to user provided functions


implemented in the host language. For the sake of simplicity, we restrict thegrammar to always include a producer and reducer. In chapter 6 we will relaxthis requirement in special cases where we omit the producer/reducer pair.

〈PCR〉 ::= 〈PCR-name〉 ( 〈param-list〉 ) 〈body〉〈body〉 ::= par 〈producer〉

{forall p {par 〈cons-list〉1 } }〈reducer〉

〈producer〉 ::= p = produce 〈f-name〉〈var-list〉〈cons-list〉i ::= 〈consumer〉i 〈cons-list〉i+1 | ε〈consumer〉j ::= cj = consume 〈f-name〉〈var-list〉 |

cj = iterate 〈cnd〉〈f-name〉〈var-list〉〈reducer〉 ::= r = reduce〈cnd〉〈f-name〉〈init〉〈v-list〉

〈param-list〉 ::= 〈param〉 , 〈param-list〉 | 〈param〉〈v-list〉 ::= 〈var〉〈v-list〉 | 〈var〉〈f-name〉 ::= 〈PCR-name〉 | 〈basic-fun-name〉〈init〉 ::= 〈basic-fun-name〉〈param-list〉

Table 4.1: PCR grammar.

Example 3 (PCR syntax). To illustrate the PCR syntax, we describe in Figure 4.3the example from Figure 4.2 in textual form, reusing some elements from thepure FXML example from Figure 3.1. We omit the par keyword inside forall ifthere is only one children. Notice that basic functions and and count have beenreplaced by calling lower-level functions && and sum as arguments of reduce. Thisenables taking care of the potential parallelism at this level (see Remark in sub-section 4.2.2). The role of functions bndf will become clear later.

1 fun sum(i,b) = i + (if b then 1 else 0)

2 fun \boundfib(x) = x+1

3

4 PCR countFibPrimes(N):

5 par

6 p = produce fib N // allFibs producer

7 forall p

8 c = consume isPrime p


1 fun bounddivisors = sqrt

2

3 PCR isPrime(F):

4 par

5 p = produce divisors F

6 forall p

7 c = consume not_divides p F

8 r = reduce && true c

Figure 4.3: Fibonacci primes counter written in PCR syntax.


4.2.2. SemanticsIn order to define the formal semantics of PCRs, we start by providing the

corresponding FXML specification of each building block: produce, consume, iterate,and reduce, for the basic case where f is a basic function (Table 4.2). In sec-tion 4.2.2 we complete the semantics taking into account nesting of PCRs.

PCR FXML

p = produce f x 1 forall(i=0;i<bndf (x);++i)2 p = f(x, p, i)

cj = consume fj x p c1..ck 1 cj = fj(x, p, c1..ck)

cj = iterate cnd fj z

1 seq

2 y = z3 repeat y = fj(y)4 until cnd(y)5 cj = ylast

r = reduce cnd⊕ v0 z1..zq

1 seq

2 v = v03 for p4 v = v[-1]⊕ 〈z1..zq〉5 if cnd(v) then break

6 r = vlast

Table 4.2: PCR building blocks and their FXML specification.

Producer

f

produce f x1,...,xn

A producer generates the set of indexedvalues to be processed by consumer and re-ducer elements. Formally, it is a forall-pnode,say P , that iterates a basic function f . LetI be the index dynamically assigned by theunderlying FXML-semantics to a particular execution of P , denoted P I . In P I ,bndf determines the number of parallel instances of f as a function of inputvariables x1..xn. For each instance i ∈ [0, . . . , bndf (x1..xn)), the producer writesits output variable p, setting its (I ◦ i)-th value pI◦i, that is, the value of indexi produced by the I-th instance of producer P . This indexing allows for theconcurrent execution of any two instances I 6= J of the producer, each one gen-erating its own set of p values, namely pI◦i and pJ◦j . To compute pI◦i, f canuse any value of the input variables x1..xn, any previous value pI◦i′ of p, i′ < i,and i. That is, a producer can look-ahead/behind at will on input variables andlook-behind on its own output. We omit these dependencies in Table 4.2.

Consumer


f

consume f x1,...,xn

A basic PCR-consumer reads a set of inputparameters and applies a basic function onthem in order to compute a single output.Formally, a consumer is an assignment-pnode,say Cj , whose left-hand side is its output vari-able cj , and its right-hand side is function fj , possibly depending on PCR inputvariables x1..xn, producer’s output p, and other consumers’ output variablesc1..ck, k < j. We restrict the j-th consumer to only read outputs of previousconsumers to avoid data-dependency loops inside a PCR. However, it is allowedto look-ahead/behind on any of them. Associated with the surrounding forall p,there is an (i, i) dependency pI◦i → cI◦ij , which is omitted in Table 4.2. Thus,for each value pI◦i written by the producer, there is an instance CI◦ij writingvalue cI◦ij .

Iterator

f

iterate cnd f x

This construct is a special kind of consumerthat enables looping a PCR until a condition isreached. In each iteration, it can look-behind touse previous values. Like produce, iterate is re-stricted to look-behind operations, as look-aheadwould generate a deadlock. Setting cnd to y[0]== y[-1] allows computing a fixpoint. ylast means the last value of y is read.

Reducer

f

reduce cnd f init x

The reducer uses a commutativeand associative operation ⊕ and aninitial neutral value v0 to combineconsumers’ outputs into a single re-sult, until cnd holds. This condition allows specifying eureka computations [34]making possible early termination. This is a common pattern in search andoptimization problems in which a solution space is searched in parallel until thefirst (or best) solution is found. Setting cnd to false corresponds to reducing allvalues. Hereinafter, we assume that this is the default condition for the reducerand omit it in this case.

A reducer is modeled as a for-pnode that reads the variables z1..zq. Thedependencies are zI◦ij → vI◦i, where vI◦i is the i-th value of the assignmentv = . . . inside the for-loop of the I-th instance of the reducer, and zI◦ij is the i-thvalue assignmed to zj in the PCR. The value vI0 represents the result of evaluatingthe initializer function. The reducer computes vI = vI0 ⊕ZI◦0 · · · ⊕ZI◦b

I whereZI◦i = 〈zI◦i1 ..zI◦iq 〉 and assigns v to r, therefore obtaining its I-th value rI . Here,bI is the minimum between the number of iterations of producer P I (there areas many iterations as values of p written by the I-th instance of the producer)and the index of the first iteration in which cnd(v) becomes true.

The sequential reducer is a special case of the generic specification shown inFigure 4.4, where K is a partition of a given set J indexing the set of values toreduce. Indexes k ∈ K are processed in parallel and sequentially reduced. Thesequential reducer is a single-partition implementation of this general model.


1 // parallel partition

2 forall k ∈ K3 rk = v04 // combine values

5 for i ∈ k6 rk = rk[-1]⊕ S7 if cnd(rk) then break

1 // combine partitions

2 r = v03 for k ∈ K4 r = r[-1]⊕ rlastk

5 if cnd(r) then break

Figure 4.4: Generic FXML specification of reduce.

PCR nesting

Nesting of PCRs. PCRs do not support recursion. Therefore, using PCR B in thedefinition of PCR A is semantically equivalent to inlining the definition of B insideA, renaming all local variables in B as fresh variables, and renaming the formalparameters of B as the variables referenced in the usage of A. In section 5.1 wediscuss recursive calling of PCRs.

Example. Figure 4.5 shows the result of inlining the body of the isPrime PCR

inside the definition of countFibPrimes (inner par block).

1 par

2 F = produce fib N

3 forall F

4 par

5 V = produce divisors F

6 forall V

7 D = consume not_divides V F

8 P = reduce && true D

9 R = reduce sum 0 P

Figure 4.5: Flattened PCR code for countFibPrimes from Example 3

Remark. It is worth making two observations. First, nesting entails the innerPCR inputs are references to outputs from the outside scope. Variables of theouter-scope have an index with lower dimension than the reading pnode. There-fore, the value is obtained truncating the index of the reading pnode to thedimension of the read variable. Second, besides the (i, i) dependencies enforcedso far, look-ahead/behind operations introduce their own data constraints. Sincethese are data dependencies, they are automatically accounted for by the un-derlying FXML semantics [54]. The way indexes are handled in these situationsis revisited in detail in chapter 7 where the proposed implementation of PCRs isexplained. In particular, see section 8.1 for a discussion about this.

Several use-cases of PCRs are discussed in chapter 6. They illustrate the applica-tion of PCRs for describing parallel computations in different contexts.


4.2.3. PCRs as mathematical functions.Property 1. The result of evaluating a PCR in FXML semantics is a functionon the input parameters assuming a) the basic producer, consumer and reducerfunctions are total, and b) no data cycles are introduced by look-ahead in basicfunctions.

Proof sketch. Showing that a PCR implements a total function involves prov-ing that any assignment done to FXML variables as the result of a produce, consumeor reduce execution is defined as some function on the original x1...xn inputvalues. The proof is divided in an outer inductive argument on the numberof nested PCR consumers, and for its base case a second induction on the num-ber of basic consumers. The following assumptions and lemmas are used: (a)the producer, consumer and reducer basic operations are total functions them-selves; and (b) nested PCR consumers behave as total functions (for the inductivestep of the first part of the proof). Proven this, it follows transitively that thevalue assigned in r = reduce f v0 x1..xnpc1..ck is, then, a function on the inputparameters, making the PCR a total function.

Chapter 5

PCR Extensions

In this chapter we introduce several extensions to the basic PCR model definedso far. Their use is illustrated in chapter 6.

5.1. Recursive parallelismProperty 1 shows that any PCR behaves like a total function. This allows

calling a PCR from any basic function in a blocking way, where the caller holdsuntil the call returns. Of course, even if the caller is blocked, the parallelisminside the callee is preserved. Calling a PCR as a function enables recursiveparallelism. A simple example is shown in Figure 5.1. It counts the Fibonaccinumbers which are deletable primes, that is, those that have the property thatdeleting digits one at a time in some order gives a prime at each step.

1 fun isPrimeAndDeletablePrime(x) = if isPrime(x)

2 then isDeletablePrime(x) else false

3 fun delete(x,i) = ... //remove i-th digit of x

1 PCR countFibDeletablePrimes(N):

2 par

3 p = produce fib N

4 forall p

5 c = consume

6 isPrimeAndDeletablePrime p


1 PCR isDeletablePrime(X):

2 par

3 p = produce delete X

4 forall p

5 c = consume isPrimeAndDeletablePrime p

6 r = reduce or false c

Figure 5.1: Example of calling PCRs inside basic functions

Notice that the the recursion presented above is not enabled by PCR inliningbut through the host language native recursion mechanism. This means thatthe recursion depth is unbounded and no infinite PCR inlining occurs.

Remark. It is worth making two observations. First, it is possible to let callerand callee proceed in parallel by not blocking the former until it needs thelatter’s output (like futures [8]), eventually upon return at the latest (like Cilk

57

58 CHAPTER 5. PCR EXTENSIONS

spawning [10]). Second, full FXML allows modelling asynchronous function calls.In the context of this thesis we consider the synchronous function call model forPCR evaluation from the host language. Nevertheless, in chapter 11 we discussthe challenges for specification and implementation of asynchronous extensionsfor PCRs in host language code.

Divide and conquer. The most prominent example of recursive parallelismis divide and conquer, a very well known algorithmic technique consisting inpartitioning a complex instance of a problem into several smaller subproblems,solving each one independently, and combining their solutions in order to cal-culate the final result. Each subproblem can be solved directly if it is simpleenough; otherwise, divide and conquer can be recursively applied.

A common way to describe a problem’s divide and conquer solution is todefine some variant of the following functions:

is_base(x), a predicate checking if problem x is a base case;

base(x), computing the solution for base case x;

divide(x), partitioning problem x into a set of subproblems; and

conquer(x,ca,cb), describing how to combine solutions ca and cb with par-ent problem x as context; this assumes that the N solutions c1, . . . , cN ofthe divide(x) subproblems can be combined by conquer(x, cN (..., conquer(x,c1, null))) ...)1.

Figure 5.2 shows a PCR-based parallel solution. The producer partitions theoriginal problem into subproblems using the iter_divide function. Dependingon the result of is_base, consumers process each subproblem either using base orrecursively calling PCR divide_and_conquer. The reducer uses conquer to combine allthe subproblems’ solutions. null is the empty subproblem. Function terminate

is used to define an eureka stopping condition.Usage of the divide and conquer pattern is illustrated with the N-Queens

problem in section 6.4.

1 fun divide, is_base, base, conquer, terminate

2 fun subproblem(x) =

3 if is_base(x) then base(x) else divide_and_conquer(x)

4 fun iter_divide(x,i) = divide(x)[i]

5

6 PCR divide_and_conquer(x):

7 par

8 p = produce iter_divide x

9 forall p

10 c = consume subproblem p

11 r = reduce terminate conquer null x c

Figure 5.2: PCR definition for divide and conquer.

1We denote null as the empty subproblem and conquer is assumed to be well defined if anyof its parameters is null, i.e. for single or no subproblem scenarios.

5.2. PCR NETWORKS 59

5.2. PCR networksPCRs enforce an (i, i) dependency between input and output. In other words,

for each input value there is always exactly one output. In some scenarios it isuseful to relax that dependency in order to forward more (or less) elements froma source PCR to one or more targets. Some examples of connections between twoPCRs A and B are:

grouping A outputs in variable-sized buckets to be processed by B;

partitioning each output of A in a variable number of items read by B;

monitoring some outputs of A with B without changing A’s behavior.

In all these use cases, enforcing the (i, i) dependency is either incompatible orcostly in practical terms. To solve this, we propose a mechanism of connectingtwo PCRs as follows. The connect(in, out, d) operation bridges the output of PCRA written to variable in to PCR B reading from variable out through a delegatefunction d(in, out): for each output value v written to in, d will be called.Execution of d could ignore v based on some condition or forward fk(v), forsome function fk, to PCRs B by the assignment out = fk(v).

Figure 5.3 shows a connect example along with its execution semantics.



3 Delegate(in, out) makeBucket =

lambda size:



6

7 par

8 forall x1:9 P1: y1 = preProcess(x1)10

11 forall x2:12 P2: y2 = doStatistics(x2)13

14 connect(y1, x2, makeBucket(3))

P 1 P 2P 0 P 5P 3 P 4

S1S0

D0 D1

Figure 5.3: (Left) Connection of preProcess and doStatistics by means of delegatemakeBucket. (Right) One possible partial order.

Delegate function makeBucket executes for each input y1 from PCR preProcess

and can forward zero or more outputs to the doStatistics PCR. The diagramdepicts one possible execution. Labels P , B and S represent instances ofpreProcess, makeBucket, and doStatistics, respectively.

Semantics of connect. Adding an intermediate and opaque delegate allows forbreaking the (i, i) dependency. The resulting network of PCRs is no longer a PCR.Indeed, adding connect extends the model into a two level hierarchy: a level of


PCRs with (i, i) internal dependencies, and a second level of connections havingdependencies between indexed sets of potentially different sizes as derived fromthe behavior of the participating delegates. The way to model this in FXML is touse weak dependencies in the connection between the source and target PCRs.

In this thesis we give a global FXML semantics of a set of connected PCRs asfollows. Let P1, . . . , Pk be a set of PCR single input parameter definitions and letD1, . . . , Dk be a set of delegates.

Given the following set of k PCRs put in parallel in a FXML specification S:1 par

2 forall x1:3 P1: y1 = P1(x1)4 . . .5 forall xk:6 PK: yk = Pk(xk)

the semantics of connect(yi, xj , Dj(in, out)) is given by the following FXML

fragment with the in and out syntactically replaced by yi and xj respectively:1 dep Pi -> Dj

2 dep bijective Dj -> Pj

3 forall yi:4 //...inline Dj code

5 Dj: xj = fj(yi)6 //...inline Dj code

In it, forall yi makes each write to yi spawn a new delegate execution. The del-egate code controls the number of writes to xj in each execution. Dependency Pi

-> Dj is weak and does not add any extra dependencies to those already imposedby the delegate code. Dependency Dj bijective -> Pj states that each write to xjmade by the delegate will spawn one execution of Pj but not neccesarily follow-ing the (i, i) correspondence. The dynamic single assignment (DSA) FXML ruleforbids any xj to be the target of two or more different connect instances. TheDSA rule also implies that if every xj is the target of a connect, the specificationwill model the empty computation, as there can be no external entry point forvalues to process in the PCR network.

Figure 5.4 shows the expanded FXML for the example from Figure 5.3. De-pendency Pi(i*size...(i+1)*size-1) -> Dj is derived from the FXML semantics andthe delegate code and is written for clarity.



3 Delegate(in, out) makeBucket =

4 lambda size:



7

8 dep P1(i*size...(i+1)*size-1) -> D1(i)

9 dep bijective D1 -> P2

1 par

2 forall x1:3 P1: y1 = preProcess(x1)4 forall x2:5 P2: y2 = doStatistics(x2)6 // inline makeBucket(3)

7 forall y1:8 if (mod (idx y1) 3 == 0) then

9 D1: x2 = bucket(y1[0]..y1[2])

Figure 5.4: Complete FXML specification of Figure 5.3 with its connect expansion.

5.3. FEEDBACK LOOP COMPUTATIONS 61

5.3. Feedback Loop ComputationsSome computations involve producing, for each input item, one or more

results, each of which being either output or feedback into the same compo-nent. Programming languages for data-parallel streaming applications, likeStreamIt [50], provide explicit constructs for specifying feedback loops. In thecontext of task-based parallelism, this behavior corresponds to the workpile pat-tern, where an instance of a task can generate more instances and add them toa pile of tasks to be done [41]. To model such behaviors, we extend PCRs with:

o = feedbackloop f v

where o is the output variable, f is a function and v is a value. Table 5.1 sketchesthe big-step operational semantics of feedbackloop. The notation A ` p ⇓ A′means that executing program p in a set of indexed assignments A yields the setof indexed assignments A′. We use E to denote the set of indexed assignmentsof the “external” variables, while X denotes the indexed assignments of thevariable x, which is “internal” to the feedbackloop. We use the notation 〈E ,X〉to make explicit the separation between external and internal variables. Rule[Fdb] states that the initial value x0 of x is v. That is, X starts being {v}.Whenever all values of x have been consumed, i.e., the set X is empty, thefeedbackloop terminates, yielding the set of assignments E ′, which extends Ewith the indexed values of variable o. Rule [DoWork] gives the semantics ofthe actual work done by feedbackloop. It states that for each value xi ∈ X , thereis an instance f i such that f i(xi) performs a set of assignments Oi and Xi onvariables o and x, respectively. The assignments to o are visible “outside” thefeedbackloop, so they are added to E . The assignments Xi on x “spawn” furtherexecutions of f , so they are added to X , while xi is removed since it has alreadybeen consumed. The ] operation ensures the attribution of appropriate indexesto the added assignments.

Fdb〈E , {v}〉 ` o = feedbackloop f v ⇓ 〈E ′, ∅〉

E ` o = feedbackloop f v ⇓ E ′

DoWorkxi ∈ X f i(xi) = Oi ∪Xi

〈E ,X〉 ` o = feedbackloop f v ⇓ 〈E ]Oi,X \ xi ]Xi〉

Table 5.1: Semantics of feedbackloop.

A major difference with the forall construct of FXML is that x is assignedinside the forall body, therefore the total number of instances of x is not knownin advance and the overall structure of the computation is not regular. Besides,feedbackloop is different from iterate in two aspects: 1) each instance may gen-erate an output, and 2) instances can be executed as soon as their dependencieshold.

Indeed, the feedbackloop construct entails a proper extension of the basic PCR

model as it somehow combines producer and consumer capabilities: it consumes


each instance of xi of x and produces a set of indexed of outputs oj0 . . . ojmi andof new instances xk◦0 . . . xk◦ni of x to be consumed. Nevertheless, it can becomposed with a reducer to get a PCR as follows:

1 PCR P(v):

2 par

3 o = feedbackloop f v

4 r = reduce cnd ⊕ r0 o

The use of this pattern is illustrated with the N-Queens case study in chap-ter 6.

5.4. PCR Execution HintsIn order to generate efficient implementations of PCR instances, extra infor-

mation about runtime properties of the designed PCR components is needed. As afirst approach, in this thesis we define several PCR-level performance hints whichcan be provided by the PCR specification writer in order to affect code generationand the runtime behavior of the synthesized implementation.

Let P be a PCR c be any of its producer, consumer and reducer componentswith input parameters p1 . . . pk. We define the runtime information tuple Rc =〈Dc, Lc, Gc〉 where

Dc : [1 . . . k]→ Index→P(Index) is a dependency information mapping,specifying, for each input parameter pj , FXML index I and component c, theset Dc(j)(I) = {J ∈ Index | execution cJ reads value pIj};

Lc : Index → N is a location mapping, giving, for each indexed execu-tion cI , a natural number Lc(I) encoding the suggested implementationdependant execution location; and

Gc ∈ N is a grain size, suggesting that Gc executions of cI should begrouped together to improve preformance.

The information in Rc can be partially or totally provided by the specifi-cation writer in order to allow the execution runtime to apply improvementswhich will depend on the capabilities of the chosen concrete concurrency modelimplementing the PCR instance. From this information, we discuss PCR propertiesthat can be derived at runtime for each PCR component execution cI , and theirapplicability to performance improvements.

Memory management. The dependency information mapping provided byDc ∈ Rc allows calculating reads(cI) ∈ N as the expected number of reads ofcI ’s output variable carried out by any other PCR component d. Let readers(c)be the set of tuples 〈d, i〉, where i is the position of c’s output variable in thelist of input parameters of d. Then,

reads(cI) =∑

〈d,i〉∈readers(c)

Dd(i)(I).

This enables memory usage optimizations by means of garbage collection ofoutput items that are no longer needed in an ongoing execution after the globalexpected number of reads is reached.

5.4. PCR EXECUTION HINTS 63

Distribution of output items. In distributed execution environments, anoptimal communication scheme between computing nodes is key in order toachieve performance. In the context of PCRs, the history of all writes to its FXML

variables is a global state which has to be stored in an implementation definedway across all nodes of a distributed computing scenario. For a given componentc, the combination of the dependency information and location mappings (Dc

and Lc) can be used to calculate the following item usage location mappings:consumer_locs(cI) = {Ld(I) | 〈d, ·〉 ∈ readers(c)} ⊂ N is the set ofdistributed locations reading the output of cI ; and

producer_locs(cI , j) = {Ld(I) | 〈c, j〉 ∈ readers(d)} ⊂ N is the set oflocations producing the inputs of the j-th input parameter of cI .

These two mappings can help define communication scheduling at implementa-tion level in order to send items only to the nodes actually consuming them.

Parallelism granularity. Choosing the correct granularity for a parallel com-putation is necessary in order to obtain a balance between the level of achievableparallelism and the synchronization, communication and task creation over-heads. The grouping size Gc can be used to define the chunk size to use in theparallel computation of component c. For instance, providing a value Gc for areducer component can enable the partitioning of the reduction of n items inton/G parallel reductions of consecutive items.

Example 4 (execution hints). The following FXML code example shows a parallelspecification of isprime with execution hints.

1D0 = lambda p : return (lambda I: {I})

2L0 = lambda I : return I % NLOCS

3 PCR isPrime(N):

4 par

5 〈D0, L = lambda I: return 0, G = 1〉6 p = produce divisors N

7 forall p

8 〈D0, L0, G = 100〉9 c = consume divides p N

10 〈D0, L = lambda I: return NLOCS-1, G = 1000〉11 r = reduce or false c

In it, D0 is a dependency mapping stating that every index of each inputparameter is read exactly once, and L0 is a location mapping implementing asimple round-robin assignment of indexes to available locations. In this examplewe reuse D0 for the producer, consumer and reducer, as all their user functionsread their input exactly once. For the producer and reducer, the same locationfor all executions is suggested by providing a constant value location mapping.The consumer is distributed using the L0 location mapping. Finally, differentgrain sizes for each component are specified, suggesting that no producer execu-tions should be grouped (G = 1) and that consumer executions should be groupedin smaller (G = 100) batches than reducer executions (G = 1000), as the rel-ative cost of the consumer is higher (division checks in potentially big integerscompared to boolean or operations).

Chapter 6

Case studies

In this chapter we illustrate how PCRs, together with the extensions discussedso far, allow expressing commonly used parallel programming patterns [41] andease comparison of different approaches to parallelize specific problems.

6.1. Low Pass FilterThis example is a pipeline with three computing stages: low pass filter (LPF),

demodulator (DEMOD), and equalizer (EQU) [50]. An interesting aspect is thatpipeline stages consume a sliding window of consecutive elements in order togenerate their output: LPF and EQU consume NUM_TAPS consecutive input elementsin order to produce a single output element, while DEMOD uses two consecutivevalues in order to generate one output.

The PCR shown in Figure 6.1 has no producer or reducer. This is equivalentto having a producer/reducer pair with the identity operation.

1 PCR lowPassFilter(S):

2 par

3 c1 = consume LPF S

4 c2 = consume DEMOD c1

5 c3 = consume EQU c2

6

7 fun LPF(v)

8 sum = 0.0

9 // consume a window of

10 // width NUM_TAPS from v

11 for i in 0..NUM_TAPS:

12 sum += v[i] * COEFF[i]

13 return sum

x0

demod1

x2x1 x4x3 x5

lpf 3lpf 1 lpf 2lpf 0

equ0

demod2

y0

demod0

Figure 6.1: Left: PCR “Low Pass Filter” with its consumer LPF. Right: Dependencydiagram for NUM_T AP S = 3.

A typical implementation would interconnect stages with buffers. Instead,the PCR-based description abstracts out any buffering scheme between stages

65

66 CHAPTER 6. CASE STUDIES

using look-ahead inside the consumers to describe this behavior (e.g., LPF). Be-sides, this example involves two parallel programming patterns: a) it showcasesan instance of a stateless parallel pipeline composed of consume components, andb) each consume can be regarded as a one dimensional stencil computation, aslook-ahead is used to access several neighbor values of the input.

6.2. Backwards ReachabilityAutomata-theoretic verification of safety properties [15] can be viewed as

the problem of checking whether a set S of error states is reachable from a setI of initial states. Figure 6.2 describes the backwards reachability algorithmfor solving this problem. It uses iterate to iterate an inner PCR Bckwrds whichperforms one-step backwards reachability. For each state s in StSet, it computesits predecessors by calling PRED, which in turn could itself be a PCR. C is thestopping condition. It can be defined so as to terminate whenever a fixpoint iscomputed or an initial state is reached, in which case an eureka reducer can beused in PCR Bckwrds.

1 // state functions

2 fun states, sunion

3

4 // PCR entry point

5 PCR BackReach(S):

6 StSet = iterate Bckwrds S C

1 // one step of backwards reach

2 PCR Bckwrds(StSet):

3 par

4 s = produce states StSet

5 forall s

6 p = consume PRED s StSet

7 r = reduce sunion StSet p

Figure 6.2: PCR “Backwards Reachability”

The backwards reachability problem can be also modelled with a feedbackloopbackward searching for reachable initial states starting from the set of finalstates, combined with an eureka reducer to stop as soon as a reachable initialstate is found.

6.3. Count WordsGiven a text T and a set W of words, count-words computes the number of

appearances of each w ∈W in T . This is a typical MapReduce example [19].Figure 6.3 shows two PCRs for counting words. PCR count-words-by-lines splits

T in lines and counts the appearances of words in W for each line in paral-lel, using basic function count. PCR count-words-by-words adds an extra level ofparallelization, by calling, for each word w ∈W , PCR count-words-by-lines.

6.4. N-QUEENS PROBLEM 67

1 fun lines, words, joinCounts,

2 count, elem

3

4 PCR count-words-by-lines(T, W):

5 par

6 l = produce lines T

7 forall l

8 c = consume count W l

9 r = reduce joinCounts {} c

1 PCR count-words-by-words(T, W):

2 par

3 w = produce elem W

4 forall w

5 c = consume count-words-by-lines

6 T [w]

7 r = reduce joinCounts {} c

Figure 6.3: Count-words.

6.4. N-Queens Problem

8 0Z0Z0L0Z7 Z0ZQZ0Z06 0Z0Z0ZQZ5 L0Z0Z0Z04 0Z0Z0Z0L3 ZQZ0Z0Z02 0Z0ZQZ0Z1 Z0L0Z0Z0

a b c d e f g hAn 8-queens solution

This problem consists in placing N queensin an N ×N chessboard, with no two queenssharing the same row, column or diagonal. Itillustrates the technique of backtracking, ascandidate solutions are built by placing onequeen in each row of the board until a) Nqueens were successfully placed (a correct so-lution was found), or b) no more queens canbe added without conflicts, or c) all boardconfigurations have been tried. In either case,a backtracking step is performed to try otherpossible configurations by reverting the lastqueen’s positioning and trying the next pos-sible square for the last row. This can be par-allelized by visiting the backtracking branches in parallel and combining partialresults.

1 fun is_base(c) = c.complete() or not c.canAddQueens()

2 fun base(c) = if c.complete() then [c] else []

3 fun is_big(c) = c.currentRow < MAXDEPTH

4 fun conquer(s1, s2) = s1 ++ s2

5 fun divide(c) =

6 cs = []

7 for i in 1..c.boardSize

8 if c.canAddQueen(i) then cs += [c.addQueen(i)]

9 return cs

Figure 6.4: N-Queens using PCR-based divide and conquer.

Figure 6.4 shows a PCR for finding all solutions using divide and conquer : c

is a configuration of the chessboard with the placed queens, and cs is a list ofconfigurations. Function complete, checks whether a configuration is a solution,canAddQueens, checks whether it is possible to place a queen, etc. The is_big

predicate determines if a subproblem is large enough to warrant solving it in


parallel: MAXDEPTH is the maximum recursion depth done in parallel beforestarting to work sequentially.

A variation consists in finding only one solution. This is done by appropri-ately defining terminate in the divide_and_conquer PCR of Figure 5.2 to stop therecursion as soon as a valid solution is found. Figure 6.5 shows an alterna-tive specification using feedbackloop composed with an eureka reducer. Functioninitial(N) constructs a start configuration of size N, and found(cs) verifies whethercs contains a complete configuration in order to stop the computation.

1 fun initial, found

2

3 fun nqueens(c,cs) =

4 for i in 1..c.boardSize

5 if c.canAddQueen(i) then

6 z = c.addQueen(i)

7 if z.complete() then cs = [z]

8 else c = z

1 PCR NQueens(N):

2 par

3 c = produce initial N

4 cs = consume NQueens_fb c

5

6 PCR NQueens_fb(c):

7 par

8 cs = feedbackloop nqueens c

9 r = reduce found ++ [] cs

Figure 6.5: N-Queens with feedbackloop.

Finally, Figure 6.6 shows a PCR using iterate. This specification restrictsparallelism with respect to the feedbackloop version as each iteration subtaskhas to finish before a new N-Queens iteration can begin, with the found reduceracting as a synchronization point.

1 fun extend(c) = if c.complete() then [c] else divide(c)

2

3 PCR nqueens_step(cs):

4 par

5 c = produce elem cs

6 forall c

7 cs = consume extend c

8 r = reduce found ++ [] cs

9

10 PCR nqueens_iterative(cs):

11 c = iterate found nqueens_step cs

Figure 6.6: N-Queens with iterate.

6.5. ID3 ALGORITHM 69

6.5. ID3 Algorithm

Weather

Parents

Visiting

Parents

Visiting

Parents

Visiting

Money

TennisCinema

Shopping

Stay in

sunny

windy

rainy

richpoor

noyes

yes no

noyes

Cinema

Cinema

Cinema

A decision tree

Decision tree learning [4] is a machinelearning method which consists in buildingand using a decision tree for classifying in-puts from a dataset. Each input consists ofa valuation for each of the attributes of in-terest in the model. Internal nodes representdecision points based on the value of a singleinput attribute; having one branch for each ofthe disjunct cases considered for the attributevalue. Leaf nodes are labeled with a class rep-resenting the result of the decision process.

ID3 [44] is an algorithm for building a de-cision tree based on an training dataset with classification examples. Figure 6.7describes a recursive version and its main dataset partitioning function.

1: function build_tree(S, A) . S: input dataset; A: remaining attributes2: if |A| = ∅ then return Leaf(mode(S))3: end if4: ga ← gain(S, a) for each attribute a ∈ A5: amax ← a ∈ A with maximum ga . max information gain6: P ← partition(S, amax)7: for each vi value of amax do8: if P [vi] = ∅ then Ti ← Leaf(mode(S))9: else Ti ← build_tree(P [vi],A \ amax)10: end if11: end for12: return Node(amax, (v1, T1)..(vk, Tk))13: end function14:15: function partition(S, a) . partition dataset S by value of field a16: P ← {}17: for each s ∈ S do18: append(P [s[a]], s)19: end for20: return P21: end function

Figure 6.7: ID3 Algorithm

In each iteration, an attribute a ∈ A is chosen such that a partitions theexample set with minimum entropy. Then, for each value vi of a in S, the algo-rithm recursively generates a sub-tree using a as root, {e ∈ examples / e[a] =vi} as example subset and A− {a} as candidate attribute set.

We identify two classes of independent parallelization opportunities in thisalgorithm. First, the parallel divide and conquer pattern (section 5.1) can beused to concurrently execute the independent subtree building tasks. Second,bulk data parallel operations like value frequency calculations, entropy, and setpartition can be naturally expressed as PCRs. Figure 6.8 shows PCRs for theseoperations.


1 fun chunks(S,i):

2 return S[i.. i+CHUNK_SIZE]

3

4 fun group_by(item,attr,map):

5 return map[item[attr]] += item

6

7 PCR frequencies(S, attr, cls):

8 par

9 p = produce chunks S

10 forall p:

11 c = consume frequency S attr cls

12 r = reduce union {} c

1 PCR entropy(S, attr, cls):

2 par

3 p = consume frequencies S attr cls

4 c = consume seq_entropy p S

5

6 PCR choose_best_attribute(S, attrs, cls):

7 par

8 p = produce iterate attrs

9 forall p:

10 c = consume entropy S p cls

11 r = reduce minimize c a

Figure 6.8: PCR data parallel operations for the ID3 algorithm.

6.6. Sentiment AnalysisSentiment analysis [42] is the practice of combining text and natural lan-

guage analysis techniques in order to measure subjective qualities: if a givenwritten user review is positive or negative, which emotion conveys a forum com-ment (anger, joy...), and similar properties. Typically, it requires processinglarge quantities of input data, as comments of millions of persons are used toinfer a general opinion on a specific matter, for instance in social networks, hotelbooking sites, etc.

Here we study an example inspired by [28] in which a pipeline of opera-tions on tweets is presented as part of the presentation of the Google CloudDataflow product. Figure 6.9 (right) depicts a scenario in which tweets re-garding a football match are read from a cloud publisher/subscriber endpoint(InputFromPubSub). The TweetTransformer component applies several transforma-tions on them, followed by two different analyses on the transformed tweets,of which we focus on the CalculateSentiment component – extracting the tweets’sentiments and grouping them in buckets of N-minute intervals. Note that theTweetTransformer operations can be applied in parallel to each incoming tweet.

Figure 6.9 (left) shows a PCR specification. The task InputFromPubSub is im-plemented in the environment, providing input tweets to the TweetTransformer

PCR. Parallelism inside the tweet pipeline is specified by PCRs and the communi-cation between TweetTransformer and the two different analyses is performed byconnect constructs (see section 5.2), as the target iteration spaces are of differentsizes. Work made by ExtractSentiment and SlidingWindow is combined into theBucketDelegate component, which transforms from the TweetTransformer iterationspace into the smaller bucketed iteration space consumed by MeanSentiment.

6.6. SENTIMENT ANALYSIS 71

1 // buckets and forwards to CalculateSentiment

2 delegate Buckety = lambda size: lambda t: ...

3 // forwards all to CorrelateKeywords

4 delegate Idy(t): y = t

5 PCR CorrelateKeywords

6

7 PCR TweetTransformer(t)

8 par

9 d = consume Deserialize t

10 tr = consume Translate d

11 tt = consume TagTeam tr

12 ss = consume ScoreSentiment tt

13 ek = consume ExtractKeywords ss

14 kt = consume KeyByTeam ek

15

16 PCR MeanSentiment(bucket)

17 par

18 m = consume Mean bucket

19 o = consume OutputMeanToBigQuery m

20

21 par

22 forall x1:23 P1: y1 = TweetTransformer(x1);24

25 forall x2:26 P2: y2 = MeanSentiment(x2);27

28 forall x3:29 P3: y3 = CorrelateKeywords(x3);30

31 connect(y1, x2, Bucket(size))

32

33 connect(y1, x3, Id)

InputFromPubSub

Deserialize

Translate

TagTeam

ScoreSentiment

ExtractKeywords

KeyByTeam

ExtractSentiment

SlidingWindow

Mean

OutputMeanToBigQuery

TweetTransformer

CalculateSentiment

CorrelateKeywords

Figure 6.9: Sentiment analysis pipeline (right) and its PCR (left)

Part III

Implementation of PCRs

73

Chapter 7

A Concrete ExecutionModel for PCRs

In this chapter we present Concurrent Collections [12]; a concrete targetmodel for implementing PCRs.

7.1. The Concurrent Collections ModelBasic Elements. A CnC program consists of a high level description of a com-putation graph, legacy code to be executed, and the environment. The basicatoms are computation steps hosting legacy code to be run; control instancesor tags, where each tag value represents the signal of one unit of work to bedone by each dependent step; and data items, which are read and written bythe steps during their computation work.

Relationships. The CnC graph is constructed by interconnecting collections ofthe introduced elements. Each tag collection prescribes a number of computa-tion steps. Tag values can be put into tag collections by the environment or bysteps; in the latter case each step controls the tag collection. Posting a new tagvalue will spawn one instance of each controlled step collection with the tag valueas a parameter. Each item collection is a concurrent map storing items indexedby tag values; providing get and put operations. Items are get/put by the en-vironment or by steps; in the latter case the steps consume-from/produce-to theitem collection. To ensure determinism, the semantics prohibit overwriting oftags or items in any collection (Dynamic Single Assignment). Figure 7.1 showsan instance of a CnC and illustrates the basic building blocks.

Semantics. CnC is described by operational semantics considering a simplifiedcore language called Featherweight-CnC [12], which formalizes the concept oflaunching steps when tags are posted. Memory (the item collections’ contents)is expressed as a flat memory array; it is assumed that each item collectioncontents are mapped univocally to data memory locations. Tag collections arerepresented by step-spawning rewrite rules in the operational semantics.

75

76 CHAPTER 7. A CONCRETE EXECUTION MODEL FOR PCRS

t0

s0 s1

i1

i2

Environment

posts tags

into t0 and

data into i0

Environment

consumes

output from i2t1i0

Figure 7.1: A diagram of a CnC graph. Diamonds, ovals and boxes respectivelyrepresent tag, step and item collections. Dotted lines indicate prescribes (control)

relationships between tag and step collections and arrows denoteproduce/consume/control relationships between step, item and tag collections.

We briefly describe the relevant Featherweight-CnC syntax elements. CnC com-putation steps are written as functions with a single tag input parameter. Execu-tion of a step S is triggered by the prescribe S operation. Memory is representedby the data global object with get(i) for reading the value stored in memorylocation i and put(i,v) for writing value v into memory location i. The CnC

semantics requirement of Dynamic Single Assignment means there can be atmost one data.put(i,v) execution for each possible value of tag i.

Example 5 (Featherweight-CnC specification). The following code shows apipeline of two consumer steps calculating f ◦ g to each input.

1 consumerf (t) {

2 x = data.get(t)3 y = f(x)4 data.put(t+ 1,y)5 prescribe consumerg(t+ 1)6 }

1 consumerg(t) {

2 y = data.get(t)3 z = g(y)4 data.put(t+ 1,z)5 }

In this code, for each value x and corresponding tag value t provided bythe environment, consumerf will read x from the data global memory at posi-tion t. Then, it will calculate a result and store it in location t + 1. In turn,consumerg will read location t+ 1 and calculate the final result at location t+ 2.Ensuring uniqueness of writes to data memory locations is the responsibility ofthe Featherweight-CnC specification writer in order to maintain a deterministiccomputation. For this specific example, the environment should provide, for in-stance, pairs xi, 3ti such that consumerf and consumerg will store the result ofapplying f and g in the 3t + 1 and 3t + 2 tag values respectively. In that way,there is a unique memory location for each written value.

7.2. Translating PCR specifications Into CnCHaving set CnC as a target computation model for PCRs, we define a translation

of a PCR into Featherweight-CnC. Afterwards, we show that the semantics of theCnC code is functionally equivalent to the FXML semantics.

CnC memory model. As CnC computation steps are stateless, the only memoryused by a CnC program is its item collections’ storage. In its operational seman-tics definition, CnC simplifies the formal memory model by mapping all itemcollections into a single flat data structure. In this translation, we assume this

7.2. TRANSLATING PCR SPECIFICATIONS INTO CNC 77

memory flattening mapping as given but keeping track of the memory mappedto each item collection. To achieve this, we denote, for the item collectionassociated with FXML variable x, datax as the memory reserved to it; thereforedatax.get(t) denotes the value stored in data for tag value t in the item collectionassociated to x. Likewise, datax.put(t,v) denotes a write operation into the itemcollection associated to x of value v for tag value t.

FXML variables and indexes. In CnC, item collections represent the assignmenthistory for FXML variables. The index of each assignment is the tag value usedas key in the collection. As FXML indexes may be multidimensional because ofiteration space nesting, the tag type is a vector of integer values. When readinga FXML variable indexed with a lower dimension, the implementation truncatesthe reader tag dimension to the read variable dimension. In this way, the getoperation done in the source item collection is always well defined.

Translation of look-ahead/look-behind. In order to translate these oper-ations on FXML variables into CnC we need to rewrite the basic f functions usedin the produce, consume and reduce primitives. Having a function f(s1, .., sk) = Ewith input FXML variables si written as an expression E, and an index I, wedefine the translation operator f I as

f(s1, .., sk)I≡ E[si[d] := datasi

.get(I + d)]i∈1..k

where E[x := y] denotes syntactic substitution of x by y in expression E.Note. Without loss of generality we assume that every variable si appears in Ein the form si[d] for some integer d, and that d is a value and not an expression.

Producer. Given a producer function f(x1..xn, p, i) and a bound expressionbndf (x1..xn), the CnC translation of p = produce f x1..xn is defined as

1 producef (I) {

2 for (i=0; i < bndf (x1..xn)I; ++i)

3 {

4 datap.put(I ◦ i,f(x1..xn, p, i)I◦i

)

5 PCR_prescribe consume1...k(I ◦ i)6 prescribe reduce(I ◦ i)7 }

8 prescribe reduce-end(I ◦ i) }

produce

inputs

p

consume 1..k

reduce

reduce-end

Note that the producer assignment increments the index dimension. Thisproducer translation prescribes a special reducer step reduce-end with a tag valueencoding the total number of items to process.

For prescribing the consumers, we define the macro PCR_prescribe C whichtranslates into prescribe C if C is a basic function, and into prescribe produceC

if C is a nested PCR, where produceC is the corresponding produce step for theCnC translation of PCR C.

Consumer. Given a basic function f(x1..xn, p, c1..ck), the CnC translation ofcj = consume f x1..xn, p, c1..ck is


1 basic_consumef (I) { datacj.put(I, f(x1..xn, p, c1..ck)

I) }

consumeinputs c

If the consumer is a PCR P , the translation does not generate a CnC step:it recursively expands the definition into the corresponding CnC translations ofproducer, consumer and reducer of P . The nested producer step does the get

operations for the input parameters, and the reducer step does the put operationon its output item collection which corresponds to variable cj .

Reducer. The CnC implementation of a PCR reducer r = reduce cnd f v0 z1..zq,with f(v, z1..zq), is defined as

1 reducef (I ◦ i) {

2 if (i==0) u = v03 else u = datav.get(I ◦ (i− 1))4 u = f(u, z1..zq)

I◦i

5 if cnd(u) datar.put(I, u)

6 else datav.put(I ◦ i, u)

7 }

8 reduce-endf (I ◦ i) {

9 datar.put(I, datav.get(I ◦ (i− 1))) }

reduce

inputs

v

rreduce-

end

The reducer folds the nested iteration space (with tags I◦0 . . . I◦(k − 1)) intothat of the outside scope (with tag I). Each reduce step executes one operationreading the output of the previous step stored in collection v, or using theinitial value v0 (for the first reducer). u is a local variable of the step in thehost language. After checking cnd, the executing step either posts the result tothe final output r or to v. reduce-end is eventually prescribed by the producerand forwards the last value from v to the output r. Exactly one of the put

operations of lines 5 and 9 is executed. If the cnd operation is omitted, theconstant function false is assumed (i.e. no input will produce an eureka event).

Iterate. The implementation of cj = iterate cnd f z is a direct translation ofthe pseudocode in Table 4.2.

1 iteratef (I) {

2 i=0

3 datay.put(I ◦ i, f(z)I)

4 repeat

5 datay.put(I ◦ (i+ 1),f(y)I◦i

);

6 i=i+1;

7 until cnd(y)I◦i

8 datacj.put(I, datay.get(I ◦ i)) }

input

iterate

y

c

Feedback Loop. Given a function f , the CnC translation of o = feedbackloop f vis


1 fdb-startf (I) {

2 i = 0

3 datax.put(I ◦ i, datav.get(I))4 prescribe fdbf (I ◦ i)5 }

6

7 fdbf (I ◦ i) {

8 O,X = f(x)I◦i

9 for u ∈ X10 j = nextindex(datax, I)11 datax.put(I ◦ j, u)

12 prescribe fdbf (I ◦ j)13 for u ∈ O14 j = nextindex(datao, I)15 datao.put(I ◦ j, u)

16 PCR_prescribe consume1...k(I ◦ j)17 }

fdb-start

input

o

consume 1..k

fdb

x

fdb

The CnC code in steps fdbf and fdb-startf use an atomic function nextindex toobtain a fresh index to use in the put operations. Performing k calls to nextindex

will return consecutive indexes from 0 to k − 1 for each combination of inputparameters. Again, u is a local variable of the step in the host language.

Evaluation. In order to implement the evaluation of a PCR P (x1..xn) on aset of input values v1..vn, we provide the following CnC code generated from thedefinition of P , where index I is provided by the environment, and reducer(P )is the output variable of P ’s reducer.

1 evaluateP (I, v1..vn) {

2 datax1.put(I, v1)3 . . .4 dataxn

.put(I, vn)5 PCR_prescribe P (I)6 return datareducer(P ).get(I)7 }

Figure 7.2 shows the CnC graph of PCR countFibPrimes (Figure 4.5). The com-plete translation into CnC is shown in Figure 7.3.

fib divisors

not_divides all count

F ZN

i0

V W

i1i2 D P

countFibPrimes

isPrime

R

Figure 7.2: CnC implementation of nesting of isPrime as consumer in countFibPrimes.


1 // PCR: countFibPrimes(N)

2 // F = produce fib N, bndfib(N) = N

3 producefib(I) {

4 for(i=0; i < dataN.get(I); ++i) {

5 dataF .put(I ◦ i, fib(F, i)I◦i

)

6 PCR_prescribe consumeisPrime(I ◦ i)7 prescribe reducesum(I ◦ i) }

8 prescribe reduce-endsum(I ◦ i)9 }

10

11 // P = consume isPrime F

12 // isPrime is a basic function

13 consumeisPrime(I) {

14 dataP .put(I, isPrime(F )I)

15 }

16

17 // R = reduce sum 0 P

18 // v is a local item collection

19 // u is a local variable

20 reducesum(I ◦ i) {

21 if (i==0) u = 0

22 else u = datav.get(I ◦ (i− 1))23 datav.put(I ◦ i, sum(u, P )

I◦i)

24 }

25

26 reduce-endsum(I ◦ i) {

27 dataR.put(I, datav(I ◦ (i− 1))28 }

29

30 env(I, N) {

31 dataN.put(I, N)

32 PCR_prescribe fibPrimes(I)33 Y = dataR.get(I)34 }

1 // PCR: isPrime(F)

2 // K = produce divisors F

3 // bnddivisors(F) = (sqrt(F)-1)/2

4 producedivisors(I) {

5 for(i=0; i < (sqrt(F )-1)/2I; ++i) {

6 dataK.put(I ◦ i, divisors(i))7 PCR_prescribe consumenot_divides(I ◦ i)8 prescribe reduce&&(I ◦ i)9 }

10 prescribe reduce-end&&(I ◦ i)11 }

12

13 // Basic consumer

14 // P = consume not_divides K F

15 consumenot_divides(I) {

16 dataD.put(I, not_divides(V, F )I)

17 }

18

19 // R = reduce && true D

20 // v is a local item collection

21 // u is a local variable

22 reduce&&(I ◦ i) {

23 if (i==0) u = true

24 else u = datav.get(I ◦ (i− 1))25 datav.put(I, u && D

I◦i)

26 }

27

28 reduce-end&&(I ◦ i) {

29 dataP .put(I, datav(I ◦ (i− 1))30 }

31

32 // Example of fib function rewrite

33 // fib(F,i) = if i<2 then 1 else F[-1]+F[-2]

34 fib(F,i)I

= if (i<2) 1

35 else dataF .get(I − 1) + dataF .get(I − 2)

Figure 7.3: Translated CnC code for the countFibPrimes example


7.2.1. Semantic CorrectnessThe implementation of PCRs by CnC computation graphs preserves the func-

tional semantics of the original PCR specification by calculating the same outputvalue.FXML and Featherweight-CnC semantics. FXML gives semantics to specifica-tions in a denotational way, describing the set of valid executions for any im-plementation of the specification. In contrast, CnC gives small step operationalsemantics to its programs. In order to relate both, we compare their functionalbehavior ; so as to map values assigned to FXML indexed variables into CnC itemcollections of the proposed implementation.

Property 2. Let P be a PCR definition of arity n, v1..vn be FXML variables withI assignments each, v = yI = P (vI1 ..vIn) be the FXML result of evaluating P withvI1 ..v

In as parameters, datav1..vn be a CnC memory region storing the full history

of assignments to v1..vn, and P̂ be the translation of P to CnC. Then, the CnC

execution of P̂ (I) stores in dataIy the value v given by [[P ]].Proof sketch. The proof (in a more detailed form in section B.1) is by struc-tural induction and case analysis on each type of component of a PCR. For eachcase, it shows that the CnC generated code functionally behaves as the FXML oneby proving that there is a CnC memory location storing the reducer output.

Chapter 8

A C++ Template Libraryfor PCRs

As part of this work, a set of C++ templates was developed to provide ahigh-level, implementation-agnostic description of PCRs. The template libraryhandles PCR nesting and integration with basic code. In this chapter we give abrief discussion of the library and the design choices made.

8.1. Implementation Agnostic PCR C++ APIIn C++ , PCRs are specified as compositions of template type definitions. A PCR

is specified as a pcr template instance parameterized with the type list of theTi input types, P producer, Ci consumers and R reducer body elements:

pcr < T_1,...,T_n, Producer, Consumer_1, ..., Consumer_k, Reducer >

Element C++ Specification Notes

T_i pcr_in< type(Xi) > input parameter types

Producer pcr_produce<PROD,parameters...> PROD: basic producer declarationpcr_feedback<FDB,parameter> FDB: basic feedback loop declaration

Consumer pcr_consume<CONS,parameters...> CONS: basic consumer or PCR

pcr_iterate<PCR,parameter> Iteration of nested PCR

Reducer pcr_reduce<RED, parameters...> RED: basic reducer declaration

PROD producer<decltype(fp),fp , tuner> basic producer declarationFDB feedback<decltype(fp),fp , tuner> basic feedback loop declaration

CONSconsumer<decltype(fp),fp , tuner> basic consumer declaration

pcr<...> nested PCR consumer declarationRED reducer<decltype(fp),fp , tuner> basic reducer declaration

Table 8.1: Summary of C++ templates used for PCR specifications.

Table 8.1 (top) summarizes the syntax for specifying Ti, P, C, and R. In allcases, parameters... is a list i1, ..., ik of positive integer constants specifying thesource parameters as relative positions backwards in the PCR body list; i.e. ik

83

84 CHAPTER 8. A C++ TEMPLATE LIBRARY FOR PCRS

means “the output of the ik-th preceding PCR body element”. The special case offeedbackloop takes a single input parameter.

Table 8.1 (bottom) shows the syntax for declaring type elements PROD, FDB,CONS and RED. A function pointer fp to a user function is required in each case.Parameter tuner is a type implementing the optional runtime information tuplefrom section 5.4. Its default type is provided by the framework. Contents ofthe tuner type are detailed later in this section (see Table 8.3).

Figure 8.1 shows a C++ specification of the countFibPrimes PCR from Example 3.Note the positional syntax for input parameter specification in the PCR body.

typedef pcr< // notes on parameters

pcr_in<num_t>, // input N of type num_t

pcr_produce< fibs, 1 >, // 1 is input N

pcr_consume< isPrime, 1 >, // 1 is fibs output

pcr_reduce < count, 1 > // 1 is isPrime output

> countFibPrimes; // countFibPrimes output: result of "count"

Figure 8.1: countFibPrimes written as a C++ template.

Host Language Code Integration. The countFibPrimes example is incom-plete as there is no definition of the components fibs, isPrime and count whichcorrespond to template parameters PROD, CONS and RED from Table 8.1. Figure 8.2extends the example showing code for the fib operation.

num_t fib(pcr_var<num_t>& F, int i) {

if (i<2)

return 1;

else

return F[0] + F[-1];

}

// id = bound_fib

num_t id(pcr_var<num_t>& N) { return N; }

// producer operation type and function pointer

typedef producer<decltype(&fib), &fib> fibs;

Figure 8.2: Host language code and binding for fib producer.

Function parameters. Basic functions in the countFibPrimes example havetheir input parameters decorated with the pcr_var template interface1. Thiswrapper type abstracts FXML variables in host language code, and allows for thelook-ahead and look-behind stream operations described in section 3.2. Table 8.2summarizes the operations implemented by this interface.

1 Decorating the basic functions’ parameters is equivalent to the translation operator EI

presented in section 7.2.

8.1. IMPLEMENTATION AGNOSTIC PCR C++ API 85

Operation Description

operator T() read implicit index valueint idx () return current implicit index

T operator[] (int) look-ahead/look-behind relative to implicit index

Table 8.2: Function parameter pcr_var wrapper operators for basic functions.

Evaluation of PCRs as functions. To enable interaction with the environ-ment, the pcr template implements the function call operator operator() whichpasses the given inputs to the PCR pattern in a non-blocking way, as the expecteduse case is to feed a PCR a stream of values to process. The results are returnedsynchronously providing a C++ PCR iterator interface together with begin and end

methods for the PCR following the C++ standard iterator library conventions.

Connecting PCRs. Section 5.2 describes a general way to connect multiple PCRs

in a tree network. This Thesis provides as a proof-of-concept a way to connecttwo PCRs by means of a delegate and to branch the output of one PCR to the inputchannels of two other PCRs.

template <typename S, typename T, typename Delegate>

struct connect;

template <typename S, typename T1, typename D1, typename T2, typename D2>

struct branch;

Above, S and T denote the source and target PCRs of a connection, while D

denotes the delegate type wrapper for each target. Templates connect and branch

define a type implementing the behavior of the combination of the given PCRs

with a delegate function. Delegate types must implement the following interface.

template <typename PCR_Target, typename input>

struct delegate {

static void apply(pcr_read<input> & IN, PCR_Target & P);

};

The connect type follows the same interface of a PCR: operator() for providinginput and an iterator type for consuming output. The branch type is similar, withthe difference that it provides two different iterator types and access functionsin order to allow consuming the outputs of each of the target PCRs independently.

In the current version of the implementation these connections cannot becomposed together, preventing the formation of true networks of PCRs. In Chap-ter 11 we discuss the technical challenges of providing a complete implementa-tion of the concept.

Figure 8.3 shows an abridged example of branch within the setting of theSentiment Analysis case study in Section 6.6, including two delegate examples.

C++ PCR performance hints. To allow tuning implementation performance asdescribed in section 5.4, the PCR C++ API provides optional template parameters


/// PCRs to connect

typedef pcr<...> TweetTransformer;

typedef pcr<...> MeanSentiment;

typedef pcr<...> CorrelateKeys;

struct bucket {

const int BUCKET_SIZE = 5;

template <typename Target, typename T>

static void apply(pcr_var<T>& in, Target& p) {

Bucket<T> b = makeBucket(in, BUCKET_SIZE);

// Feed bucket b into PCR instance p

p(b);

} };

struct id {

template <typename Target, typename T>

static void apply(pcr_var<std::string>& in, Target& p) {

// forward the input into PCR instance p

p(in[0]);

} };

typedef branch <TweetTransformer, MeanSentiment, bucket,

CorrelateKeys, id> SentimentAnalysisPipeline;

Figure 8.3: Sentiment analysis pipeline code example.

affecting runtime behavior. Table 8.3 describes the C++ interface for providingperformance tuning hints in an implementation-independent way.

The tuple Rc from section 5.4 relates to methods from Table 8.3 as follows.

Information provided by the dependency information mappingD has to beprovided by methods expected_reads, consumed_indexes and consumer_indexes.

The location mapping L corresponds to the output of the locationsmethod.

Grain size G corresponds to GROUP_SIZE.

The minimal set of parameters described in Table 8.3 can be used by thetranslation and runtime. All the parameters are functions of each FXML variableindex in order to allow a different tuning for each item to process.

8.2. Platform Dependant Target Code Genera-tion

Having defined all the PCR interface templates, we applied template metapro-gramming techniques in order to unfold the PCR definitions into a CnC executiongraph.

In the CnC C++ framework, a computation graph is represented by an instanceof the CnC::context class. Every step, item and tag collection has to be boundto the containing context instance.

8.2. PLATFORM DEPENDANT TARGET CODE GENERATION 87

PCR hint Description

expected_reads(N) specify the number of reads of each index of Nth parameterconsumed_indexes(N) specify the set of consumed indexes of N-th parameter for each

consumer indexconsumer_indexes(N) specify the set of consumer indexes of N-th parameter for each

producer indexlocations(nlocs) index to location mappingsize_t GROUP_SIZE group size to use in a reduction or consumer; the number

of parallel reduce chunks for N elements will be ceil(N /GROUP_SIZE)

Table 8.3: PCR optional performance tuning hints

As PCRs are composable by nesting and the CnC context is flat, the imple-mentation generates a CnC::context subclass representing the complete PCR in aflat form. The template expansion rules convert a pcr<X1,..,Xn,P,C1,...,Ck,R>

definition into an class public inheritance chain of the form:cnc(R) : cnc(Ck) : ... : cnc (C1) : cnc(P) : cnc(Xn) : ... : cnc(X1) :

CnC::context.Here, cnc represents the synthesized implementation class for the correspond-

ing PCR component. Table 8.4 describes briefly the generated subcontexts foreach type of PCR basic component.

PCR component Generated CnC context

pcr_in<T> dummy context holding a reference to the actual item col-lection storing T values and its corresponding tag collection

pcr_produce<P,...> one step collection executing the producer; one item collec-tion for output values; one tag collection to prescribe con-sumers and reducers and another for prescribing the last re-ducer step; producer steps are prescribed by the outer scope

pcr_feedback<FDB,...> one step collection for starting the feedback loop and an-other for executing each feedback loop step; one item collec-tion for output values and another for feedback values; onetag collection for self-prescribing feedback steps, one to pre-scribe consumers and reducers and another for prescribingthe last reducer step; feedback steps are prescribed by theouter scope

pcr_consume<C,...> one step collection executing the consumer; one item collec-tion for output; steps prescribed by the producer

pcr_iterate<P,...> two step collections, one for triggering the first iteration andanother for iterating the nested PCR and checking the fin-ish condition; one item collection for intermediate iterationresults and one for output; steps prescribed both by the pro-ducer and self-prescribed

pcr_reduce<R,...> two step collections, one for each reduce step and one forthe last reduce step; one item collection for output; stepsprescribed by the producer

Table 8.4: Overview of CnC implementations of PCR building blocks


CnC implementation of pcr_var. Each FXML variable is mapped into an itemcollection mapping each FXML assignment to the assigned value. Figure 8.4 showsan abridged implementation of the read operations given in Table 8.2. Templatetype tag_t represents a static integer vector whose dimension will depend on thePCR nesting level of a pcr_var instance; operation last returns the index value forthe innermost nesting level. Template type var_t abstracts out the type of thevalue stored by the pcr_var.

template <typename tag_t, typename var_t>

class cnc_pcr_var : pcr_var {

item_collection_t& items;

tag_t current_tag;

operator var_t() { return *this[0]; }

int idx() { return current_tag.last() }

pcr_var& operator[] (int idx) {

var_t out;

items.get(current_tag+idx, out);

return out;

}

}

Figure 8.4: C++ CnC abridged implementation of the pcr_var type.

CnC implementation of PCR evaluation. In order to allow communicationbetween the PCR and its enclosing scope, the translation synthesizes a CnC im-plementation which essentially performs the operations described in section 7.2with the difference that sending input to the PCR is implemented by operator(),and consuming the results is encapsulated in the CnC implementation of theiterator interface described in section 8.1.

CnC performance tuning. The C++ CnC framework supports specifying tunerswhich are optimization hints applied on a CnC graph step, tag, and item collec-tions. Tag tuners allow for partitioning of tag ranges; which enables processingof a group of tag values by the same step instance, improving locality. Tagtuners also enable memoization of tag values. Item tuners let the user specifythe number of expected read operations on each stored item (get_count). Thesealso allow specifying in which process (for distributed scenarios) each item is tobe produced/consumed on. Step tuners let the programmer a) declare data de-pendencies by specifying which tag values a step will consume from which itemcollections in a specific step execution and b) specify the step relative priorityand give hints of thread/process affinity to the scheduler. Worth of mentionis the cancel_tuner letting the programmer signal the early termination of allrunning instances of a step.

Considering the performance hints described in Table 8.3, we analyse theirapplication to the CnC implementation:

expected_reads is used in the consumer step tuner to declare data dependen-cies and in the item tuner of each consumed input to declare the get_count

property for memory usage optimization.


consumed_indexes and consumer_indexes are used together with location toimplement the consumed_on and produced_on CnC distributed tuning hints.

GROUP_SIZE is used by the CnC implementation to handle chunking of itemsto process in consumers and reducers.

Figure 8.5 shows a schematic of the interaction of a CnC context reading theoutput from a source CnC context, and the tuning usage by the latter.

producer context

cnc tuner

producer context PCR tuner

producer context

steps

consumer context

producer context

output item collection

post expected reads

and locations

collect tuning information

set items get_count, produced_on

and consumed_on propertiesset execution location

self-declare

input dependencies

Figure 8.5: PCR tuning information usage in the CnC implementation

Figure 8.6 shows a CnC context using consumed_indexes in order to predeclareits input dependencies. Figure 8.7 show a CnC consumer context using GROUP_SIZE

in order to implement chunking at the infrastructure level. Figure 8.8 shows thecode from a CnC context forwarding consumer_indexes, expected_reads, and locations

to each of the steps producing its input, and Figure 8.9 complements the formershowing how a CnC tuner structure uses its local PCR tuner and the informationprovided by other consumer steps in order to apply tuning.

template <typename ctx, typename tag, typename declarer, typename tuner,

int p, typename P, typename... Params>

struct dependency_declarer<ctx, tag, declarer, tuner, p, tuple<P, Params...> >

{

static void declare(ctx& c, const tag& t, declarer& dC, const tuner & tun) {

tag t0 = t; auto d = tun.consumed_indexes(p);

for (const auto& i : d(t.idx())) {

t0.idx() = i;

dC.depends(P::get(c), t0);

}

dependency_declarer<ctx,tag, declarer, tuner,

p+1, tuple<Params...> >::declare(c,t,dC,tun);

} };

Figure 8.6: CnC context using consumed_indexes to declare the input dependencies forits computing step


struct step {

int execute( const in_tag_type &t, consumer_context & c) const {

if (tuner_t::GROUP_SIZE > 1) {

if ((t.idx() % tuner_t::GROUP_SIZE) == 0) {

in_tag_type t0(t);

t0.idx()-= tuner_t::GROUP_SIZE - 1;

str_step<info>::execute_range(t0,t+1,c);

}

return CnC::CNC_Success;

}

else

return str_step<info>::execute(t,c);

} };

struct last_chunk_step {

int execute( const in_tag_type &t, consumer_context & c) const {

if (tuner_t::GROUP_SIZE > 1) {

if ((t.idx() % tuner_t::GROUP_SIZE) != 1) {

in_tag_type t0(t);

t0.idx() = t.idx() - ((t.idx()-1) % tuner_t::GROUP_SIZE);

str_step<info>::execute_range(t0,t,c);

}

}

return CnC::CNC_Success;

} };

Figure 8.7: Code fragment of a CnC consumer using GROUP_SIZE to implementchunking

template <typename ctx, typename steps, int N, typename tuner,

typename P, typename... Ps>

struct consume_declarer<ctx, steps, N, tuner, tuple<P, Ps...> > {

static void declare(ctx& c, steps& s) {

tuner pcr_tuner;

typedef typename P::getter_type get_t;

typedef typename get_t::source_t src_t;

// declare CnC consume relationship with input item collection

s.consumes(get_t::get(c));

// Access the tuner from the context generating input

auto & input_tuner = static_cast<src_t&>(c).the_item_tuner;

// declare relevant tuning information to input context

input_tuner.readers_information.push_back(

make_info(pcr_tuner.expected_reads(N),

pcr_tuner.locations(distributor::numProcs()),

pcr_tuner.consumer_indexes(N)));

// iterate remaining input parameters

consume_declarer<ctx,steps,N+1, tuner, tuple<Ps...> >::declare(c,s);

} };

Figure 8.8: Code fragment of a PCR context posting its expected_reads,consumer_indexes and locations information to each step producing its input


template <typename local_pcr_tuner, typename ctx, typename tag_t>

struct local_cnc_tuner : public hashmap_tuner,

public cancel_tuner<tag_t> {

// local tuning information

local_pcr_tuner tuner;

// tuning information provided by reader steps

vector<tuning_t> readers_information;

// Tell CnC the expected item lifetimes

template<typename pcr_tag_t>

int get_count(const pcr_tag_t & tag) const {

int count = 0;

// accumulate all readers' expected read counts

for (const tuning_t& d : readers_information)

count += d.expected_reads(tag.idx());

return count > 0 ? count : CnC::NO_GETCOUNT;

}

// Tell CnC the location holding this step's output

template<typename pcr_tag_t>

int produced_on(const pcr_tag_t & tag) const

{ return tuner.location(tag, distributor::numProcs()); }

// Tell CnC which other locations consume this output

temaplate<typename pcr_tag_t>

vector<int> consumed_on(const pcr_tag_t & tag) const {

vector<int> locations;

// Accumulate the location of each reader for the current tag

for (const tuning_t& t : readers_information) {

for (const indexes_t& idx : t.consumer_indexes(tag))

locations.push_back(t.location(idx));

}

return locations;

}

// Tell CnC which location to use for running the current step

template<typename pcr_tag_t, typename Arg>

int compute_on(const pcr_tag_t & tag, Arg& arg) const

{ return tuner.location(tag, distributor::numProcs());}

};

Figure 8.9: Code fragment of a CnC tuner using the performance hints

Chapter 9

Experimental evaluation

In this chapter we evaluate the practical application of PCRs in terms ofperformance. Our goals are (a) to validate the applicability of the CnC codegeneration technique developed for PCRs, (b) to validate the applicability of high-level PCR performance tuning for the CnC implementation, and (c) to illustratePCRs as a tool for comparing different parallel implementations to solve the sameproblem.

To fulfill these ends, we first compare performance of hand-coded CnC im-plementations against PCR specifications in terms of time and space. Second,we evaluate the effect of PCR performance tuning on several multicore and dis-tributed benchmarks. Finally, we compare the runtime performance of differentparallel implementations for the same problem, using PCRs as a developmenttool.

9.1. MethodologyBenchmarking runs were performed on a server with 4 processors (AMD

Opteron (TM) Processor 6276 @ 2.30GHz) of 16 computing cores each (64 totalcores) with 128GB of system memory running the 64-bit version of RedHatLinux Enterprise 6.7. Each run was repeated 10 times and total elapsed runningtime and peak resident memory usage (Resident Set Size or RSS) were recordedas informed by the operating system. For each 10-run set of time an memorymeasurements, the median was calculated. In order to assess variability, therange of values was also calculated. The number of cores to be used in each runwas controlled by setting the CPU affinity mask using the Linux taskset utility.The same CPU mask was used for each k-core run. For each benchmarkedprogram the problem size was fixed (strong scaling), variyng only the numberof cores used.

9.2. Basic performance analysisIn this section we make a preliminary analysis of the current CnC imple-

mentation of our PCR parallel code library. Given that at this stage the CnC

implementation of PCRs is not heavily optimized, the goal is to provide an initial

93

94 CHAPTER 9. EXPERIMENTAL EVALUATION

overview of the running time behavior to assess the feasibility of the approach.For benchmarking analysis we followed the recommendations given in [33].

To elaborate the performance analysis carried out, the following two mainquestions were considered:

Q1 How does PCR generated code compare to hand-coded CnC solutions in termsof running time and memory consumption?.

Q2 How does automatic CnC code generation impact compilation resources andexecutable sizes?.

Question 1. For comparison, we chose two basic CnC programs: blackscholesand producer-consumer. Both are pipelines with one and two consumer stagesrespectively. In these cases virtually equivalent running times and memory usageare expected, as the CnC structure synthesized by the PCR code generation rulesshould closely match the existing hand-coded implementation. Figure 9.1 sum-marizes the results. The 2-letter prefix code in captions denotes each benchmarkas follows: BS for blackscholes and PC for producer-consumer. Each boxplotcompares PCR against CNC in time / memory usage when using 16, 32 and 64cores.

tim

e (

s)

0.5

1

1.5

2

2.5

3

0.5

1

1.5

2

2.5

3


(a) BS time measurements

RSS(KB)

40,000

60,000

80,000

100,000

120,000

140,000

160,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000


(b) BS RSS measurements

tim

e (

s)

4

6

8

10

12

14

4

6

8

10

12

14


(c) PC time measurements

RS

S (

KB

)

0

20,000

40,000

60,000

80,000

100,000

120,000

0

20,000

40,000

60,000

80,000

100,000

120,000


(d) PC RSS measurements

Figure 9.1: PCR vs CNC time and memory comparisons for different numbers ofavailable computing cores.

For blackscholes, execution times are very similar with very low dispersionfor 32 and 64 core executions in both implementations. Memory consumption

9.3. N-QUEENS PERFORMANCE STUDY 95

is also very similar in both implementations showing that memory consumptionhas some dispersion in both implementations independently from the numberof cores.

The same time and memory similarities are observed for producer-consumer.Time dispersion is still noticeable for the 32-core runs and almost neglible in64-core runs. This is consistent with the fact that a longer pipeline will needmore computing cores in order keep items flowing through it at a steady rate.

Question 2. To answer the second question, we compiled different PCR pipelinesof increasing lengths and measured compilation time, compilation memory usageand final executable size. Figure 9.2 shows the results.

YA

xis

Tit

le

200

250

300

350

400

450

500

550

600

650Y

Ax

is T

itle

0

5

10

15

20

25

pipeline length

0 5 10 15 20

compiler RSS (MB)

compile time (s)

(a) Compilation resource costs

ex

ec

uta

ble

siz

e (

KB

)

200

400

600

800

200

400

600

800

pipeline length

0 5 10 15 20

(b) Stripped executable size

Figure 9.2: Compilation time and space costs and excutable sizes for different PCRpipeline lengths

Measurements show a linear increase in both time and space, suggestingthat PCR synthesis cost and impact on binary executable size scale linearly withPCR size, validating the scalability of the approach in terms of code generationresource costs.

Final remarks. In general, benchmarks show that the preliminary PCR imple-mentation has qualitatively similar performance compared to direct CnC imple-mentations. Also, even if executable size grows when writing a PCR, sizes arewithin reasonable bounds for general purpose parallel code. Together, theseresults support the applicability of the CnC code generation tool.

9.3. N-Queens performance studyIn this section we compare the three implementations of N-Queens presented

in section 6.4. We use the eureka versions, finishing the computation as soon asthe first result is found. We set the time limit to one minute.

First, we analyse running times for various recursion depths and problemsizes (Figure 9.3).

The iterate implementation with N=16 shows its best performance withrecursion depth 3 and reaches the time limit for recursion depths higher than4. For N=27, this implementation does not finish within the time limit for anyrecursion depth. The divide and conquer implementation shows much better


N=16

tim

e (

s)

0

10

20

30

40

recursion depth

0 1 2 3 4 5 6


N=27

tim

e (

s)

0

20

40

60

80

recursion depth

0 2 4 6 8 10 12 14

feedbackloop DC

Figure 9.3: Run time comparison between N-Queens PCR implementations withvarying recursion depth for different N .

running times than iterate. We observe that increasing the recursion depthworsens running times, eventually hitting the time limit with recursion depthshigher than 3. This implementation does not scale with increasing problemsizes, as shown in the N=27 graph. The feedbackloop implementation performsbadly for small recursion depths, reaching the time limit. We observe thatincreasing recursion depth improves running times, eventually achieving bettertimes than the other implementations. Figure 9.4 summarizes the best runningtime achieved in each implementation for different problem sizes, showing thatonly feedbackloop scales beyond N=29.

tim

e (

s)

0

20

40

60

NQ board size (N)

15 20 25 30


Figure 9.4: Summary of the best running time in each implementation for N=16..32

As a second performance measure, Figure 9.5 box-plots the running timesof the best PCR implementation we identified so far (based on feedbackloop) com-pared to the hand-coded CnC version which is part of the CnC suite. This im-plementation follows the work-pile pattern. Both implementations use equalrecursion depth. All runs were repeated 20 times and were performed using 16,32 and 64 cores. With 32 cores, variability reduces, but a small advantage isstill apparent in the CnC implementation. Finally, with 64 cores little variabilityis observed; with both implementations having almost equal running times.

9.4. COUNT-WORDS PERFORMANCE STUDY 97

time(s)

0

1

2

3

0

1

2

3

FDB16 CNC16 FDB32 CNC32 FDB64 CNC64

Figure 9.5: Time comparison between the feedbackloop N-Queens implementationand a hand-coded CnC implementation.

9.4. Count-words performance studyIn this section we perform a performance comparison of the two PCR solutions

proposed for the count-words problem in section 6.3. The actual implementa-tions partition the file in chunks of several lines. Runs were repeated 10 timesand the median of the measurements was recorded. An input text file of 111Mlines with a total size of 8GB was used.

tim

e (

s)

25

30

35

cores

16 24 32 48 64

by lines by words CnC by lines tuned

Figure 9.6: Comparison of count-words implementations for increasing numbers ofavailable computing cores.

Figure 9.6 shows running times of PCR-based implementations for counting 9words with a chunk size of 10K lines. It also compares PCR versions against apure CnC implementation of count-words taken from the CnC samples which uses aparallel reduce graph. Globally, count-words-by-lines exhibits better performancethan count-words-by-words, but the gap tends to diminish for larger number ofcores. For more than 32 cores, PCR-based versions show slightly better runningtimes than the CnC one. We also analyzed the effect of providing tuning hints atPCR-level for count-words-by-lines. From its specification on Figure 6.3, it followsthat variables l and c are written as many times as lines are produced, butread exactly once by the consumer and by the reducer, respectively. A simple


tuning action is to specify this fact, with dependencies, enabling the runtime tofree memory as soon as each value is read. Figure 9.6 shows that the tunedversion delivers the best performance.

Clearly, the number of words to count and the chunk size used for parti-tioning the input file affect performance. Figure 9.7 shows running times ofboth non-tuned PCR-based implementations on 64 cores, for counting 100 wordswith increasing chunk sizes. The running time of count-words-by-lines increasessteadily with the chunk size. This is consistent with the fact that it searches eachword in every chunk sequentially, and so bigger chunks reduce the exploitableparallelism. On the other hand, count-words-by-words, which counts each word inparallel, takes advantage of a larger number of words to count while too manysmall chunks affect its performance negatively. The figure shows that its run-ning time goes down up to a chunk size of 100K, without further improvement.For chunk sizes of 10M, count-words-by-lines shows an abrupt decrease in perfor-mance compared to count-words-by-words. This is consistent with the fact thatfor this size there are less chunks than the number of available processors. Inthis scenario count-words-by-words benefits of the extra dimension of parallelism.

tim

e(s

)

50

100

150

chunk size

10k 100k 1M 5M 10M

by linesby words

Figure 9.7: Comparison of two count-words implementations with varying chunksizes.

9.5. PCR and CnC usage in teachingIn this section we report an experience of using PCRs as a framework for teach-

ing pattern-based parallel programming to undergraduate students in ComputerScience. We compare the result of using pure PCRs against using CnC to providea parallel solution to a problem.

Context. During the second semester of 2017 an elective pattern-based par-allel programming course1 was given to undergraduate students. The FXML spec-ification language was taught as a formal way to specify potential parallelism.Besides, PCRs were presented along with the CnC parallel programming model.Course students had no previous experience with either PCRs or CnC, and onlyhad basic concurrent programming knowledge (threading and synchronizationprimitives) from a previous mandatory Operating Systems course.

1Location: Departamento de Computación, FCEyN, UBA.

9.5. PCR AND CNC USAGE IN TEACHING 99

Activity. The final project for the course was to choose a parallel program-ming framework to design a parallel solution for a problem chosen by the stu-dents, and describe the parallel semantics using FXML specifications. Two stu-dents chose CnC and PCR to parallelize the problem of counting prime numbers ina given interval. They implemented the Miller-Rabin prime checking algorithmin its deterministic variant, using the Sieve of Eratosthenes algorithm as an ac-celerator: having a set of all primes from 2 to some constant K, every numberin the interval K − K2 can be checked for primality by only searching for adivisor in the known primes set.

Students chose to perform primality checks in parallel for each number inthe interval to check, while keeping the Miller-Rabin algorithm itself sequential.

Parallel Implementations overview. Figure Figure 9.8(a) shows the CnC

graph of the implementation written by the students. The main function in-stantiates this context and fills the small primes item collection with an Er-atosthenes Sieve containing a fixed number of primes between 2 and K for arange of N numbers to check. Then, it creates tags in the ranges tag collectionwith all partitions of the complete number range to count in. These partitionsare processed by step Find Primes which marks as composite any multiple ofthe primes in small primes. Unmarked numbers, if less than K2, are deemedas prime and stored in the primes item collection. Remaining potential primesare put in the possible primes tag collection to be processed in the Rabin Millerstep. This second step executes the Rabin Miller algorithm on the given poten-tial prime, and, if the test is passed, the number is stored in primes. Essentially,there is one instance of Find Primes per range and one instance of Rabin-Millerper each potential prime not screened by the Sieve.

Find Primes

Rabin Miller

small primes

primespossible

primes

ranges

from

environment

to environment

(a) CnC parallelization

Ranges

Sieve

DeterminePrimality

Count

producer

consumers

reducer

(b) PCR parallelization

Figure 9.8: Graphical representations for CnC and PCR parallel CountPrimes

Figure Figure 9.8(b) shows the PCR model for the same algorithm. The small_-

primes collection is kept as a global vector and is filled in the main function.Then, the PCR is instantiated and is given a single number range to process. TheRanges producer generates partitions which are read by the Sieve consumer,performing the same work as Find Primes in the CnC implementation, returninga vector with a set set of potential primes for the partition. This set is readby consumer Determine Primality which calculates the count of actual primes,


invoking the Rabin Miller algorithm to verify numbers bigger than K2. Thefinal Count reducer combines these totals and returns the final number of primesfound. This implementation has a slightly smaller degree of parallelism, as allthe potential primes in the same partition are checked sequentially with theRabin-Miller algorithm.

Code analysis. We performed a preliminary qualitative analysis of the student-written code in two basic characteristics: (a) code separation from business logicand parallel infrastructure; and (b) complexity of code devoted to describingavailable parallelism.

Figure 9.9 (left) shows the code for the Find Primes computing step. Notethat user code has to be aware of the existence and names of the context itemand tag collections, and has to handle them manually in order to preserve theintended global flow of execution. Figure 9.9 (right) shows the user functionexecuted by the Sieve consumer. Note that it resembles a traditional function,with the exception that the input parameter is decorated with the pcr_read

template. After reading the input numPair into variable np, the rest of the codeproceeds as an ordinary function.

1 int FindPrimes::execute(range r, my_ctx & c)

2 const {

3 vector<bool> is_prime((r.to-r.from)/2, true);

4 for (auto cii = c.small_primes.begin();

5 cii != c.small_primes.end(); cii++) {

6 int prime = cii -> first;

7 ll first_multiple =

8 firstMultiple(r.from, prime);

9 if (first_multiple % 2 == 0)

10 first_multiple += prime;

11 for (ll i = first_multiple;

12 i < r.to; i += prime * 2)

13 is_prime[(i-r.from-1)/2] = false;

14 }

15 for (int i=0; i<is_prime.size(); i++) {

16 ll num = r.from + i * 2 + 1;

17 if (is_prime[i] && num > 1) {

18 if (num <= MAX_CRIBA * MAX_CRIBA)

19 c.m_primes.put(num, num);

20 else

21 c.possible_primes.put(num);

22 }

23 return CnC::CNC_Success;

24 }

1 vector<long long> _sieve(pcr_read<numPair>& m) {

2 numPair np = m;

3 range r; r.from = np.first; r.to = np.second;

4 long long count = 0;

5 vector<bool> is_prime((r.to-r.from)/2, true);

6 for (auto cii = small_primes.begin();

7 cii != small_primes.end(); cii++) {

8 int prime = *cii;

9 long long first_multiple =

10 firstMultiple(r.from, prime);

11 if (first_multiple % 2 == 0)

12 first_multiple += prime;

13 for (long long i = first_multiple; i<r.to;

14 i += prime * 2)

15 is_prime[(i-r.from-1)/2] = false;

16 }

17 vector <long long> ve;

18 for (int i=0; i<is_prime.size(); i++) {

19 long long num = r.from + i * 2 + 1;

20 if (is_prime[i] && num > 1)

21 ve.push_back(num);

22 }

23 return ve;

24 }

Figure 9.9: CnC and PCR user code integration compared.

Figure 9.10 compares the CnC (left) and PCR (right) parts of the implementa-tions defining the CnC graph and the PCR instance, as depicted in Figure 9.8. TheCnC code defines the context class and the programmer manually declares connec-tions between components using the produces, consumes and prescribes methods.In this code we found that the students missed one relationship declaration: thefind_primes step collection should control the possible_primes tag collection as thestep code shown in Figure 9.9 (left) posts new tags into it. The PCR description isshorter. Lines 1-5 pre-declare the producer, consumers and reducer to be used,and lines 7-13 declare the PCR itself.

9.5. PCR AND CNC USAGE IN TEACHING 101

1 typedef long long ll;

2 struct my_ctx : public context<my_context> {

3 step_collection< FindPrimes > find_primes;

4 step_collection< RabinMiller > rabin_miller;

5 tag_collection< range > m_ranges;

6 tag_collection< ll > possible_primes;

7 item_collection< ll, ll > m_primes;

8 item_collection< int, int > small_primes;

9

10 my_ctx() : context< my_context >(),

11 find_primes(*this), m_tags(*this),

12 m_primes(*this), small_primes(*this),

13 rabin_miller(*this), possible_primes(*this) {

14 m_tags.prescribes(find_primes, *this);

15 possible_primes.prescribes(rabin_miller,*this);

16 find_primes.consumes(small_primes);

17 find_primes.produces(m_primes);

18 rabin_miller.produces(m_primes);

19 }

20 };

1 typedef producer<decltype(&ranges), &ranges> Ranges;

2 typedef consumer<decltype(&_sieve), &_sieve> Sieve;

3 typedef consumer<decltype(&_determine), &_determine>

4 DeterminePrimality;

5 typedef reducer<decltype(&_count), &_count> Count;

6

7 typedef pcr<

8 pcr_in<numPair>,

9 pcr_produce<Ranges, 1>,

10 pcr_consume<Sieve, 1>,

11 pcr_consume<DeterminePrimality, 1>,

12 pcr_reduce<Count, 1>

13 > countprimes;

Figure 9.10: CnC and PCR pure infrastructure code compared.

Run-time behavior analysis. Figure 9.11 plots the scalability of both im-plementations for number ranges of increasing size. In Figure 9.11(a), bothimplementations are shown to perform identically as input size increases. How-ever, Figure 9.11(b) shows that space consumption in the CnC implementationclimbs much quicker than in the PCR version. This is explained by the extradegree of parallelism in the CnC version. As it checks each prime candidate inparallel after the sieve pre-screening, a bigger input interval will result in manymore Rabin Miller computation steps to spawn. Memory consumption in thePCR version will grow proportionally to the number of partitions to check, as noindividual primes are checked in parallel.

800

Interval size

1,000M

(a) Running time comparison

2,500

Interval size

1,000M

(b) Memory usage comparison

Figure 9.11: Runtime comparison of the CnC and PCR implementations.

Students’ feedback. After the experience and implementations’ performancereview, we discussed with the students their development experience. Compar-ing both frameworks, they found PCRs easier to understand and write, evenstating that “It’s just like writing plain functions”. They found the CnC graphdevelopment more confusing and error prone.


Experience conclusions. This preliminary experience shows promising re-sults and suggests PCRs can be used to teach parallel programming. We think itwould be valuable to perform a study with several students in order to assessthe educational capabilities of PCRs.

Part IV

Discussion

103

Chapter 10

Related work

PCRs are related to both algorithmic skeleton frameworks [27] and streamprocessing models [48]. Following the conclusions summarized in [49], we ana-lyze the capabilities of current streaming and skeleton frameworks with respectto: a) exposing task, data, and pipeline parallelism; b) exposing the presenceof sliding window parallelism; c) preventing the usage of stateful filters; d) nat-urally describing complex computation topologies; and e) keeping the parallelproblem description platform-agnostic.

Hereinafter, we refer to the atomic components of a framework as Single/-Multiple Input (SI/MI) or Single/Multiple Output (SO/MO) with respect to thenumber of input/output channels a component can have.

PCRs cover the mentioned requirements by a) execution semantics allowing forconcurrent instances of the same consumer and chaining of consumers, togetherwith FXML dependencies describing data parallelism; b) providing look-ahead andlook-behind operations on PCR variables in order to enable expression of slidingwindow and stencil computations; c) use of pure functions in basic code; d)supporting nested PCRs with named parameters for the producer, consumer andreducer inputs (MISO), allowing for complex connections of PCRs while keepinga functional behavior; and e) separating PCR specifications from the executionplatform.

Algorithmic skeleton frameworks. We limit the discussion to those whichare more closely related to PCRs.

Quaff [24] declares a concurrent computation coordination topology by thecomposition of basic skeletons. It uses legacy code functions restricted to a singleinput parameter (SISO). Another limitation is that the farm construct requiresfixing at compile time the number of processors to be used at runtime. In [23],a CSP [32] semantics of Quaff is provided without showing the implementationcorrectness with respect to this semantics.

Muesli [14] supports task parallelism by constructing a topology of connectedprocesses and data parallelism by using distributed data structures. The topol-ogy is constructed by composing object instances (dynamic polymorphism) ofbasic skeletons (Pipe, Filter, Farm) and algorithmic ones (DivideAndConquer,BranchAndbound). All communication in data parallel structures is explicit sothe problem description has embedded communication logic. No formal modelis provided.

105

106 CHAPTER 10. RELATED WORK

MapReduce [19] defines a simple computation model together with a ref-erence implementation handling work distribution. An extension [21] of theMapReduce implementation allows iterative computations, similar to PCRs iterate

construct. PCRs extend MapReduce with the concepts of composition by nesting,iteration, producer components, and “native” look-ahead/look-behind.

The Orleans Skeleton Library [36] follows the Bulk Synchronous Parallel(BSP) [51] computation model with SISO basic atoms. Its semantics is for-malized by term rewriting but no formal relationship with the actual executionmodel (MPI) is provided.

SkePU [22] is a skeleton library targeting multi-core and GPU parallelism.It supports MISO computations but limits the number of input parameters to3. No formal semantics of the skeleton execution is given.

STAPL [57] focuses on compositional reuse of skeletons and provides adomain-specific syntax for describing complex component interconnections. Itsupports iteration but it lacks formal semantics.

Summarizing, to the best of our knowledge, most algorithmic skeleton ap-proaches are restricted to SISO components, lack formal semantics, neither sup-port iteration nor eureka computations, and do not provide specific support forlook-ahead and look-behind stream operations.

Stream programming models. StreamIT [50] is a programming languagebased on pipelines of filters. Each filter has one input and one output streams.Communication is achieved by push, pop and peek (look-ahead) operations onstreams. Complex topologies can be assembled using FeedbackLoop (iteration)and SplitJoin connectors which first separate and later combine items from/toone stream to/from many. The main differences with PCRs are: a) destructivepop operations on streams preventing filters from using the full history of streamvalues (look-behind); b) lack of direct support of filters with multiple indepen-dent input streams; and c) SplitJoin is restricted to duplicate and round-robin,while PCR connect supports any user-defined policy.

FastFlow [2] is a layered stream programming framework with a bottom layerof SISO components connected by anonymous channels used to provide MISOand MIMO components in the middle layer; a top layer of skeletons completesthe framework. It does not provide look-ahead/behind capabilities.

RISC-pb2l [1] is a set of parallel building blocks implemented over FastFlowallowing the construction of complex parallel computation patterns. RISC doesnot model directly complex data connections because channels are anonymous,and inherits FastFlow’s lack of support for look-ahead/look-behind in its inputchannels.

S-Net [29] is a streaming computation model consisting of stateless SISOboxes interconnected by streams of records forming an acyclic computationgraph. Record subtyping is provided to enable composition and adaptationof boxes to different environments. The S-Net model does not provide look-ahead/behind capabilities on its input streams, and can not easily model com-plex data connections because channels are anonymous.

In summary, none of the cited frameworks fully supports the concepts of look-ahead/behind. Besides, they rely on unnamed connections making it difficultto specify complex interactions between components.

107

Other models. Multi-BSP [52] is an extension of BSP [51]. This multi-leveltree model aims at describing the concurrency capabilities of an architecturemade of a combination of hardware and software elements. Each level definesrun-time parameters such as number of processors, synchronization/communi-cation costs, and cache sizes. Its goal is to write concurrent algorithms awareof architectural parameters in order to achieve an optimal implementation forany architecture for any value of the parameters in some specifiable sense. Animportant difference with PCRs is that Multi-BSP imposes several requirementson architectures to support it, while PCRs are completely architecture-oblivious.

Dryad [35] models coarse-grained computations as directed acyclic graphswith sequential pieces of code in the nodes. The description language is flatalthough a graph composition operator is provided. The main focus is on im-plementation performance and scheduling adaptation to available resources.

Clearly, both models are interesting targets for automated code generationfrom PCR specifications.

Other relevant CnC code generation efforts. The Event-Driven Tasksmodel (EDT) [59] is a framework for automatic parallelization of sequential Cloops, using an intermediate C++ implementation-agnostic representation layer.In that work, three different parallel implementation run-times are targeted asback-ends and compared, CnC being one of them. Also, the EDT usage ofinteger vectors for modelling loop nesting resembles this Thesis’s modelling ofmultidimensional FXML indexes in order to support PCR nesting and look-ahead /look-behind operations.

In a recent work [58], a data locality tuning extension of CnC is proposedfor distributed and shared memory scenarios. The current PCR implementationcan be updated in order to generate locality tuning information based on theuser provided PCR runtime information hints.

Chapter 11

Conclusions and FutureWork

11.1. Summary of contributionsFormal models. From a theoretical point of view, we defined a composition-supporting parallel pattern which combines concepts like collectives, eurekas,iteration, recursion, and stream programming. We formalized its abstract se-mantics and proposed a concrete one through a formal translation of PCRs intoCnC. We illustrated with several case studies from different application domainsthat PCRs can ease writing parallel programs.

Parallel programming tools. Tool-wise, we developed a framework whichconsists of (i) a library for writing platform-independent PCRs, and (ii) a code-generation engine based on template meta-programming for translating theminto CnC-based implementations. It is worth mentioning that PCR C++ templatesare designed to enhance portability in the sense that different target runtimescould be used for code generation.

The framework provides theory-driven automated generation of parallel codefrom platform-independent, high-level, structured descriptions. The experimen-tal results provided evidence that the synthesized code can achieve performanceswhich are comparable to those of low-level, unstructured, platform-dependentprograms. This is a sign that the approach could be feasible in practice. Froma software engineering perspective, it would result in less coding effort, morereliability and faster prototyping of several implementations.

11.2. Future challengesThere are several development oportunities for PCRs .

Formal model extensions. We plan to generalize the usage of PCRs nestingin both producer and reducer elements. Regarding connect, we plan to extendFXML semantics in order to allow describing richer connections between PCRs, in

109

110 CHAPTER 11. CONCLUSIONS AND FUTURE WORK

particular, joining in a non-deterministic way the output streams of two or morePCRs into the input stream of a third one.

Another future extension is the ability to specify computations in which the(i, i) relationships inside a PCR can be relaxed to bijective relationships in orderto enable re-indexing of output items based on the order of generation insteadof the indexing order imposed by the producer. This allows the reducer to startprocessing items as soon as they are produced which is particularly useful forimplementing faster eureka computations.

C++ high-level framework extensions. An envisaged extension of the frame-work consists in enabling asynchronous composition of PCRs, letting caller andcallee proceed in parallel by not blocking the former until it needs the latter’soutput or it returns (e.g., Futures [53], Cilk [10]). Other future extension to C++

PCRs is to give a general way to write PCR networks by means of a generic connect

construct with composition capabilities.

CnC implementation extensions. The current CnC implementation does nottake advantage of using CnC tag tuners which can help implement more effi-cient partitioning of tag ranges for coarse grained computations. We have tomake an extensive evaluation of the distributed capabilities of the CnC-based PCR

implementation.

Alternative implementations. Other future research directions include ex-perimenting PCRs on other platforms, such as distributed clusters, and GPUs, aswell as implementing the PCR concept in other programming languages.

User studies. Extending the experience reported in Section 9.5, we plan torun experiments measuring ease of use of PCRs compared to other parallel pro-gramming frameworks, both with students and with experienced programmers.One goal is to evaluate the feasibility of PCRs as tools for concurrent programmingteaching.

Appendices

111

Appendix A

FXML ParallelSpecifications

FXML [6] is a language for expressing concurrency, together with control anddata dependencies which can be annotated with properties to restrict parallelismbecause of timing or precedence constraints.

In this section, we review the FXML basic syntax to be used in for thesemantics in the following chapters of this thesis.

A.1. Abstract syntaxThis section overviews the abstract syntax of FXML used for the remainder

of this thesis. It is out of the scope of this chapter to explain the full concretesyntax of FXML pnodes used by Jahuel, which is defined as an XML schema.

The body of an FXML specification is composed of blocks called pnodes.1

Basic pnodes:

nil denotes an empty set of executions.

Let X be the set of variables. Variables store values from a set V . Anassignment α has the form x0 = ζ(x1, . . . , xn), where xi ∈ X, i ∈ [0, n],and ζ : V n → V is a computable function. We write αi for xi.Variables are assumed to be assigned in static single assignment (SSA)form, that is, there is only one assignment statement for each variable.A variable v appearing in expressions denotes the value of a certain pre-vious assigment made to v with some index I. The assignment index canbe made specific (i.e. vI appearing in an expression) in order to refer tosome specific value of the history of assignments to v.

legacy B declares a block B of legacy code written in, e.g., C, C++, Java,etc.

1The term pnode stands for “presentation node”. This notion comes from model theory: apnode “presents” an abstract execution.

113

114 APPENDIX A. FXML PARALLEL SPECIFICATIONS

Conditional pnodes: if ζ(x1, . . . , xn) then p else q, where p and q are pnodes,and ζ : V n → B is a boolean function.

Sequential composition: seq p1 . . . pn.

For/while loops: for(i = init(x1, . . . , xn);test(i);i = inc(i))〈per=P〉 p, andwhile(test(x1, . . . , xn))〈per=P〉 p, express iterations: i is the iteration variable,init : V n → is a computable function that gives the initial value of i, inc :→is the increment function, and test :→ B is a boolean function that defines thelooping condition. Variable i is assumed not to be modified in p.

The optional declaration per=P states that the loop is periodic with periodP , that is, there is a loop iteration every P . The execution time of the body ofthe loop has to be attached to p. Different loop iterations may have differentexecution times, as long as they are consistent with the loop period P and theexecution time interval attached to p (see below).

Parallel composition: par p1 . . . pn.

Forall loops: forall(i = init(x1, . . . , xn);test(i);i = inc(i)) p where i, init,inc and test are as for for-loops specifies parallel executions of p.

Data-driven loops: forall v p where v is a FXML variable specifies parallelexecutions of p spawned by assignments made to v. In execution of the pof index i, apparitions of v refer to value vi. Analoguously, for v p specifiessequential executions of p for each assignment on v.

Labeling: L: p is a pnode.

Dependencies: dep{〈[a,b]〉〈type〉L1 → L2}p, with a, b ∈, specifies a depen-dency between occurrences of two descendants Li : pi, i ∈ [1, 2], of p. Theoptional declaration type annotates the dependency with a type:

The default type is weak and means that at least one occurrence of p1should precede every occurrence of p2.

The type strong means that every occurrence of p1 should precede at leastan occurrence of p2.

The type (k, f(k)) means that the f(k)-th occurrence of p2 should bepreceded by the k-th occurrence of p1.

The optional declaration [a,b] specifies that the timing distance between thecorresponding occurrences of p1 and p2 falls in the interval [a, b].

Execution times: p[a,b], a, b ∈, means that the execution time of p is in theinterval [a, b].

Example 6 (Writer/Reader). The FXML specification of a simple programwhere a reader reads and prints out a value written by a writer is as follows:

A.1. ABSTRACT SYNTAX 115

dep [0,15] W -> R

par

Writer:

seq

p = 0 [0,1]

while(true) per=10

W: seq {

x = p

p = p + 1

} [0,1]

Reader:

while(true)

R: seq {

y = x

legacy{ printf("%d\n", y); }

} [0,1]

The declaration dep W → R declares the dependency between occurrences of pn-odes labelled W, in Writer, and R, in Reader. This dependency comes from the factthat variable x must have some value by the time Reader uses it. Since no typeis attached to the dependency, it follows that it is of the default type weak. Thismeans that written values may not be read or read more than once, but x musthave been written at least once by Writer before it is first read by Reader.

This default behavior can be strengthen with a strong type declaration torequire that every written value must be read at least once. To specify thatthe value written in the i-th iteration of the Writer’s loop must be used in thei-th iteration of the Reader’s loop, the declaration (i,i) has to be added to thedependency dep W → R.

The period declaration attached to the while loop of the Writer states thatthe body of the loop is executed periodically every 10 time units. The interval[0,15] in dep [0,15] W→R serves for specifying a freshness constraint: the valueof x cannot be read if the time distance between the write and read operations isgreater than 15 time units. The execution times of p = 0, W and R are specifiedto be in the interval [0, 1]2.

Example 7 (Smith-Waterman). The Smith-Waterman [20] local sequence match-ing algorithm consists of computing the elements of a N + 1 by M + 1 matrixA, from two strings S1 and S2 of lengths N + 1 and M + 1, respectively.

In FXML, it can be expressed as follows:

dep ((i,j), (i+1,j)) LA -> LX

dep ((i,j), (i,j+1)) LA -> LY

dep ((i,j), (i+1,j+1)) LA -> LZ

seq

forall(j = 0; j <= M; j+1)

forall(i = 0; i <= N; i+1)

LI: A[i][j] = 0

forall(j = 1; j <= M; j+1)

2When the execution time of a pnode is not given, it means it can take an arbitrary amountof time to execute which is consistent with all timing constraints.


forall(i = 1; i <= N; i+1)

seq

par {

LX: X = A[i-1][j] + 2

LY: Y = A[i][j-1] + 2

LZ: Z = A[i-1][j-1] + (S1[i]==S2[j]?-1:1)

}

LA: A[i][j] = MIN(0, X, Y, Z)

The dependencies state that the computation of each element (i, j) is a functionof its “North” (i−1, j), “West” (i, j−1), and “NorthWest” (i−1, j−1) neighborsA.

Hereinafter, the keyword seq will be omitted in the examples.

A.2. Semantics

A.2.1. DefinitionsBefore giving semantics to FXML specifications, let us introduce some defi-

nitions.

Indexed assignments: An index is a list I of natural numbers and labels.〈`1, . . .〉 denotes the list consisting of elements `1, . . ., and ◦ denotes concatena-tion of lists. An indexed assignment is denoted αI . A set of indexed assignmentsis denoted A.

Timing: Time is modeled with a timing function τ : A → R+ × R+. Wewrite, τ b(αI) = π1(τ(αI)), and τe(αI) = π2(τ(αI)), which denote respectivelythe beginning and ending times of assignment αI .

τ satisfies A, denoted τ |= A, iff for each αI ∈ A, τ b(αI) ≤ τe(αI).

Dependencies: Let Out = {x ∈ X | ∃αI ∈ A : x = α0} be the set ofvariables assigned in A.

The relation d−→⊆ A × A models data dependencies: for all βJ ∈ A, forall βj , if βj ∈ Out, then there exists a unique αI ∈ A s.t. α0 = βj andαI

d−→ βJ . We write αI βj−→ βJ as a shorthand for αI d−→ βJ ∧ α0 = βj .

τ |= d−→ iff ∀αI , βJ ∈ A : αId−→ βJ =⇒ τe(αI) ≤ τ b(βJ).

The relation ;−→⊆ A×A gives an order between indexed assignments inA, thus modeling dependencies derived from the sequential composition.

τ |= ;−→ iff ∀αI , βJ ∈ A : αI;−→ βJ =⇒ τe(αI) ≤ τ b(βJ).

We define −→= d−→ ∪ ;−→. τ |=−→ iff τ |= d−→ and τ |= ;−→.

A.2. SEMANTICS 117

Valuations: A can be seen as a family {An}n∈ of sets of indexed assignments,where An contains only indexed assignments αI of the form x0 = ζ(x1, . . . , xn).Let Υ = {νn}n∈ be a family of -indexed functions, with νn : An → V n+1,νn(αI) = (v0, v1, v2, . . . , vn), v0 = ζ(v1, v2, . . . , vn). We write νn(αI)i for vi.

Υ |= d−→ iff ∀(αI , βJ) ∈ An ×Am: αI βj−→ =⇒ νm(βJ)j = νn(αI)0.

Executions: An execution e is a tuple (X,A, V, d−→, ;−→, τ,Υ), such that τ |=A, τ |=−→, Υ |= d−→.

Timing constraints: The starting and ending time of e are, respectively:τ b(e) = minαI∈Ae

τ b(αI), and τe(e) = maxαI∈Aeτe(αI).

Subexecutions: f is a subexecution of e, denoted f ⊆ e, iff Af ⊆ Ae,τf = τe �Af

, νnf = νne �Anf, d−→f= d−→e�Af×Af

, ;−→f= ;−→e�Af×Af, where �

is “restricted to”.

Partitions: A partition of e, denoted &i∈Iei, is such that for all i ∈ I, ei is anon-trivial subexecution of e, and for all i 6= j, Aei∩Aej = ∅, and

⋃i∈I Aei

= A.A sequential partition of e, denoted ;i∈Iei, is a partition such that ∀αI ∈

Aei, βJ ∈ Aej

: i < j =⇒ αI;−→e β

J .

Dependencies: For e1, e2 ⊆ e, e1 −→e e2 iff ∀αI ∈ Ae1 , βJ ∈ Ae2 : αI −→e

βJ .

Indexed executions: The indexing of e with K is the execution eK whereAeK is defined such that for all αI ∈ Ae : αK◦I ∈ AeK . We write e ∼=

KeK to

denote that e is the same execution as eK modulo the indexing with K.

A.2.2. Semantic rulesThe semantics of an FXML specification is a set of executions. We use an

algebraic definition of the semantics [25]. If p is a pnode and e is an execution,e |= pmeans that e is an execution for p. The semantics of p is [[p]] = {e | e |= p}.

Nil: The semantics of nil is the empty execution (∅, ∅, ∅, ∅, ∅, ∅, ∅).

Assignments: e |= α iff Ae = {αI} for some index I.

Conditional statements: e |= if ζ(x1, . . . , xn) then p else q iff e = e1; e2 s.t.e1 |= ζ(x1, . . . , xn), with Ae1 = {αI}, and e2 |= p if νne1

(αI) = true, else e2 |= q.

Sequential composition: e |= seq p1 . . . pn iff e = ;i∈[1,n]ei, such that ei |=pi.


Iterations: Let K = {kj}j∈J (with J a finite or infinite interval of ) be theindexed set of the values taken by the iteration variable i. K is defined by inc,which is increasing, that is: i < j =⇒ ki ≤ kj , for all ki, kj ∈ K.

The semantics of for-loops is the set of executions defined as follows:e |= for() p iff e = ;j∈J fj〈j〉, where fj |= p, j ∈ J , and for every αI ∈ Amfj

(for any m ∈): αIl = i =⇒ νmfj(αI)l = kj (that is, the value of the

iteration variable i is equal to kj in fj).If the optional declaration [per=P] is present, e is such that: for all j ∈ J ,[ τ b(fj〈j〉), τe(fj〈j〉) ] ⊆ [(j − 1)P, jP ].

For while, assignments are indexed using a hidden variable j, whose valuesare 0, . . . , N − 1, when the loop stops after N turns:e |= while(test(x1, . . . , xn)) p iff e = ;j∈[0,N−1]

(cj ; fj〈j〉

); cN , where

cj |= test(x1, . . . , xn), j ∈ [0, N ], fj |= p, j ∈ [0, N−1], and the conditionsevaluate to true in cj , j ∈ [0, N − 1], and to false in cN . The semantics ofa non-terminating loop is an infinite execution where conditions evaluateto true for all cj .If the period declaration is given, the semantics is similar to for-loopperiods.

Parallel composition: e |= par p1 . . . pn iff e = &i∈[1,n]ei, such that ei |= pi.

Proposition 1. Parallel composition is commutative and associative.

Proposition 2. [[seq p1 . . . pn]] ⊆ [[par p1 . . . pn]].

Forall loops: Let K = {kj}j∈J be the indexed set of indices defined by inc.e |= forall() p iff e = &j∈J fj

〈j〉, where fj |= p, j ∈ J , and for every αI ∈ Amfj

(for any m ∈): αIl = i =⇒ νmfj(αI)l = kj .

Data-driven Forall loops: Let K = {kj} be the indexed set of indices ofassignments to v. e |= forall v p iff e = &j∈J fj

〈j〉, where fj |= p, j ∈ J , andfor every αI ∈ Amfj

(for any m ∈): αIl = i =⇒ νmfj(αI)l = kj .

Proposition 3. [[for(i = init(x1, . . . , xn);test(i);i = inc(i))〈per=P〉 p]] ⊆[[forall(i = init(x1, . . . , xn);test(i);i = inc(i))〈per=P〉 p]].

Proposition 4. [[par p1 . . . pn]] ∼=〈i〉

[[forall(i=1;i<=n;i+1)p]], where ∼=〈i〉

meansequal modulo the indexing given by the iteration variable i.

Dependencies: e |= dep{〈[a,b]〉〈type〉L1 → L2}p, iff e |= p, and

type = weak: for every e2 ⊆ e, s.t. e2 |= L2 : p2, there exists e1 ⊆ e, s.t.e1 |= L1 : p1 and e1 −→e e2.

type = strong: e satisfies the condition above and for every e1 ⊆ e, s.t.e1 |= L1 : p1, there exists e2 ⊆ e, s.t. e2 |= L2 : p2 and e1 −→e e2.

A.2. SEMANTICS 119

W0

W1

W2

R0

R1

0

10

5

15

20

R2

25

30

W0

W1

W2R0

R1

W0

W1

W2

R0

R1

R2

W0

W1

W2

R0

R1

R3

W0

W1

W2

R0

R1

R2R3

R2

W0

W1

W2

R0

R1

R2

(a) (b) (c)weak strong (i,i)

;

;

;

;

;

; ;

;

; ;

;

; ;

;

;

; ;

;

;

;

;

;

;

; ;

;

;

;

;

; ;

;

;

;

;

;

;

;

Figure A.1: Examples of executions of Writer-Reader.

type = (j, f(j)): for all j ∈ J , if e〈j〉1 ⊆ e is s.t. e〈j〉1 |= L1 : p1, and

e〈f(j)〉2 ⊆ e is s.t. e〈f(j)〉

2 |= L2 : p2, then e〈j〉1 −→e e〈f(j)〉2 .

If the optional timing interval [a,b] is present, e is such that: for all ei ⊆ e,ei |= pi, i = 1, 2, e1 −→e e2 =⇒ τ b(e2)− τe(e1) ∈ [a, b].

Proposition 5. Let q be dep{〈type〉L1 → L2}p. We have that:[[q[type := (i, i)]]] ⊆ [[q[type := strong]]] ⊆ [[q[type := weak]]].

Proposition 6. ∀[a′, b′] ⊆ [a, b]:[[dep{[a′, b′]L1 → L2}p]] ⊆ [[dep{[a, b]L1 → L2}p]] ⊆ [[dep{L1 → L2}p]].

Execution times: e |= p[a,b] iff e |= p and τe(e)− τ b(e) ∈ [a, b].

Proposition 7. ∀[a′, b′] ⊆ [a, b]: [[p[a′, b′]]] ⊆ [[p[a, b]]] ⊆ [[p]].

Example 8 (Writer/Reader). Figure A.1 shows examples of executions of Ex-ample 6 for different types of dependencies between pnodes W and R: (a) weak,(b) strong, and (c) (i, i). Non-labelled assignments are not shown. The verticalplacement of W’s and R’s corresponds to their occurrence in global time, whichproceeds from top to bottom. Recall that, by Proposition 5, any execution of type(i, i) is also strong, and any strong is also weak.

The executions of pnodes Writer and Reader are total orders of the formW0

;−→ W1 · · · and R0;−→ R1 · · · , respectively, which are consistent with the

timing constraints (Writer’s loop period and execution times). Each executionof the composed system contains the union of the executions of pnodes Writer

and Reader which are consistent with the dependency declaration dep [0,15] W →R, together with precedences added by it. For instance, in the execution shownin (a)-left, the value written by W0 is read by R0 and R1. This means that R0 andR1 started at most 15 time units after W0 terminated.

However, the occurrence of W1 between R0 and R1 does not prevent the valuewritten by W0 to be read twice. This execution models a behavior that may occurin a concrete implementation of this program where values are buffered.

Example 9 (Smith-Waterman). Figure A.2 shows a part of the model of theSmith-Waterman program (Example 7).


LA(1,1)

LI(0,1) LI(1,0)LI(0,0)

LX(1,2) LY(1,2)LZ(1,2)

LA(1,2)

LI(0,2)

LX(2,1) LY(2,1)LZ(2,1)

LA(2,1)

LI(2,0)

LX(1,1) LY(1,1)LZ(1,1)

; ;

; ; ;

; ;;;;

;;; ; ; ;

Figure A.2: Examples of executions of Smith-Waterman.

Appendix B

Extended Proof Sketches

B.1. Functional correctness of the CnC transla-tion of PCRs.

In order to prove Property 2, we will make use of the following lemmas:

Proposition 8. The result of evaluating a basic function f(x1..xn) in FXML

semantics corresponds to the evaluation of f(x1..xn) in CnC semantics, providedthe value of each assignment xIi is stored in dataIxi

for some CnC memory locationdataxi .

Proof. By structural induction in the expression E and applying the definition ofEI , replacing every apparition of xi[d] by dataxi

.get(I + d) in E and evaluatingthe resulting expression in CnC semantics.

Proposition 9. For any variable v in the definition of a PCR Q, and index Iof an assignment to v as specified by the FXML semantics of the evaluation ofQ(x1..xn), there is an equivalent CnC memory location dataIv storing, after theevaluation of Q̂(x1..xn) in CnC semantics, the same value as assignment vI .

Proof. By case analysis of each PCR component type (producer, consumer orreducer). The reader is directed to section 7.2 for the definition of each CnC

translation. This proof assumes that all the basic functions f used in the pro-ducer, consumer and reducer components are total functions. Also, the proofuses Proposition 8 to assume that f and f calculate the same value.

Producer. The FXML producer reads input variables x1..xn and its own previ-ous output p and assigns p = F (x1..xn, p, i) for all i = 0..boundf (x1..xn). TheCnC producef step reads the input from datax1..xn and evaluates f(x1..xn, p, i)

I◦i

for each i, storing its result in dataI◦ip (line 3 of producef ). Thus, each assignmentto p has an unique CnC data location.

For each consumer type, we need to show all the inputs of a consumer Ciare stored in well defined CnC memory locations.

121

122 APPENDIX B. EXTENDED PROOF SKETCHES

Basic consumer. We apply global induction in i. Case i=0. ConsumerC0 only reads x1..xn, p which have been already shown to be stored in CnC

memory locations, and stores its output in CnC memory location datac0 . Casei>0. Consumer Ci reads x1..xn, p and c0..ci−1. By inductive hypothesis, all theCj outputs with j < i are stored in CnC memory locations datacj , then Ci has allits input variables stored in CnC memory locations, and can be evaluated.

In all cases, the FXML consumer Ci reads input variables x1..xn and con-sumer output variables c0..ci−1 and assigns cIi = f(x1..xn, p, c0..ci−1). The CnC

basic_consumef step reads the input from datax1..xn and datac0..ci−1 and evaluatesf(x1..xn, c0..ci−1)

Istoring its result in dataIci

(line 2 of basic_consumef ). Thus,the assignment cIi is stored in CnC memory location dataIci

.

Nested PCR consumer. Having proven the basic consumer case, we can applya recursive argument to treat the nested PCR evaluation as a basic function eval-uation and consider this case to be functionally equivalent to a basic consumer.

Iterating consumer. The FXML iterate construct is divided in two stages: thefirst one calculates the fixed point of f(y) with a straightforward CnC translation;assignments to local FXML variable y are stored in datay. The second part consistsof the last assignment (line 8 of step iteratef ) showing that assignment cIi isstored in dataIci

. If the function f is a nested PCR the same recursive argumentas in the nested PCR consumer case applies.

Reducer. FXML reduce is similar to iterate in the sense that it first calculatesan iterative computation and finally assigns the result to a CnC mamory location.The reduce iteration sequentially applies operation f on the temporary value vand the I ◦ i index of FXML variables x1..xnc1..ck. The reducef step mimics thisiterative behavior, finally storing the last v value in dataIR (line 5 of step reducef ).

Thus, for each type of PCR component we have shown that the assignmentsit makes are stored in a well defined CnC memory location.

Property 2. Let Q be a PCR definition of arity n and P̂ its CnC translation. Letv1..vn be FXML variables with I assignments each and value v = yI = Q(vI1 ..vIn)be the result of evaluating P with parameters vI1 ..vIn as specified by FXML. Letdatav1..vn

a CnC memory region storing the full history of assignments to v1..vn.Then, the CnC execution of P̂ (I) will store in dataIy the same value v as specifiedby the FXML semantics of P .

Proof. This result follows from the application of the previous propositions:Proposition 9 warrants that the reduce assignment RI (which by definition isthe result of the evaluation of the PCR in FXML semantics) has a well definedmemory location dataIR in CnC semantics and Proposition 8 warrants that thecalculated value is equivalent in both formal models.

Bibliography

[1] M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, and M. Torquati.Design patterns percolating to parallel programming framework implemen-tation. International Journal of Parallel Programming, 42(6):1012–1031,2014. 1.9.1, 2.2, 10

[2] M. Aldinucci, M. Danelutto, P. Kilpatrick, and M. Torquati. Fastflow: high-level and efficient streaming on multi-core. Programming multi-core andmany-core computing systems, parallel and distributed computing, 2014.1.9.1, 10

[3] C. K. Anand and W. Kahl. Synthesizing and verifying multicore parallelismin categories of nested code graphs. In Process Algebra for Parallel andDistributed Processing, volume 2, pages 3–45. Chapman & Hall, 2009. 2.2

[4] C. Apté and S. Weiss. Data mining with decision trees and decision rules.Future generation computer systems, 13(2):197–210, 1997. 1.5, 6.5

[5] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubia-towicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, andK. Yelick. A view of the parallel computing landscape. Communicationsof the ACM, 52(10):56–67, Oct. 2009. 1.1.1, 2.1

[6] I. Assayad, V. Bertin, F.-X. Defaut, P. Gerner, O. Quévreux, and S. Yovine.Jahuel: A formal framework for software synthesis. In Formal Methods andSoftware Engineering, pages 204–218. Springer, 2005. 1.1.3, 1.2.2, 2.2, 2.3,3.2, A

[7] B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, and M. Vanneschi. P3l:A structured high-level parallel language, and its structured support. Con-currency: Practice and Experience, 7(3):225–255, 1995. 1.2.1, 3.1

[8] H. C. Baker, Jr. and C. Hewitt. The incremental garbage collection ofprocesses. In Proceedings of the 1977 Symposium on Artificial Intelligenceand Programming Languages, pages 55–59, New York, NY, USA, 1977.ACM. 5.1

[9] E. Belikov, P. Deligiannis, P. Totoo, M. Aljabri, and H.-W. Loidl. A sur-vey of high-level parallel programming models. Heriot-Watt University,Edinburgh, UK, 2013. 1.1.2, 2.2

[10] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall,and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of

123

124 BIBLIOGRAPHY

parallel and distributed computing, 37(1):55–69, 1996. 1.1.2, 1.9.2, 2.2, 5.1,11.2

[11] R. Buchty, V. Heuveline, W. Karl, and J.-P. Weiss. A survey on hardware-aware and heterogeneous computing on multicore processors and accelera-tors. Concurrency and Computation: Practice and Experience, 24(7):663–675, May 2012. 1.1.1, 2.1

[12] Z. Budimlić, M. Burke, V. Cavé, K. Knobe, G. Lowney, R. Newton, J. Pals-berg, D. Peixotto, V. Sarkar, F. Schlimbach, et al. Concurrent collections.Scientific Programming, 18(3):203–217, 2010. 1.1.2, 1.1.3, 1.6, 1.6.1, 2.2,2.3, 7, 7.1

[13] B. Chamberlain. Chapel, chapter 6. MIT Press, 2015. 1.1.2, 2.2

[14] P. Ciechanowicz, M. Poldner, and H. Kuchen. The Münster Skeleton Li-brary Muesli - A Comprehensive Overview. Technical report, University ofMünster, 2009. 1.9.1, 10

[15] E. M. Clarke, O. Grumberg, and D. Peled. Model checking. MIT press,1999. 1.5, 6.2

[16] M. I. Cole. Algorithmic skeletons: structured management of parallel com-putation. Pitman London, 1989. 1.1.2, 1.2.1, 2.2, 3.1

[17] L. Dagum and R. Menon. Openmp: an industry standard api for shared-memory programming. IEEE computational science and engineering,5(1):46–55, 1998. 1.1.2, 2.2

[18] J. Darlington, Y.-k. Guo, H. W. To, and J. Yang. Parallel skeletons forstructured composition. In ACM SIGPLAN Notices, volume 30, pages 19–28. ACM, 1995. 1.2.1, 3.1

[19] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on largeclusters. Communications of the ACM, 51(1):107–113, 2008. 1.5, 1.9.1, 2.2,6.3, 10

[20] K. Ebcioglu, V. Sarkar, T. El-Ghazawi, J. Urbanic, and P. S. Center. Anexperiment in measuring the productivity of three parallel programminglanguages. In Proceedings of the Third Workshop on Productivity and Per-formance in High-End Computing, pages 30–36, 2006. 7

[21] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, andG. Fox. Twister: A runtime for iterative mapreduce. In Proceedings ofthe 19th ACM International Symposium on High Performance DistributedComputing, HPDC ’10, pages 810–818, New York, NY, USA, 2010. ACM.1.9.1, 10

[22] J. Enmyren and C. W. Kessler. Skepu: a multi-backend skeleton program-ming library for multi-gpu systems. In Proceedings of the fourth interna-tional workshop on High-level parallel programming and applications, pages5–14. ACM, 2010. 1.9.1, 2.2, 10

BIBLIOGRAPHY 125

[23] J. Falcou and J. Sérot. Formal semantics applied to the implementationof a skeleton-based parallel programming library. Parallel Computing: Ar-chitectures, Algorithms and Applications (Proc. of PARCO 2007, Julich,Germany), 38:243–252, 2008. 1.9.1, 2.2, 10

[24] J. Falcou, J. Sérot, T. Chateau, and J.-T. Lapresté. Quaff: efficient c++ de-sign for parallel skeletons. Parallel Computing, 32(7):604–615, 2006. 1.9.1,2.2, 10

[25] J. Goguen and G. Malcolm. Algebraic semantics of imperative programs.MIT press, 1996. A.2.2

[26] H. González-Vélez and M. Cole. Adaptive structured parallelism for dis-tributed heterogeneous architectures: A methodological approach withpipelines and farms. Concurrency and Computation: Practice and Ex-perience, 22(15):2073–2094, 2010. 1.1.2, 2.2

[27] H. González-Vélez and M. Leyton. A survey of algorithmic skeleton frame-works: high-level structured parallel programming enablers. Software:Practice and Experience, 40(12):1135–1160, 2010. 1.1.2, 1.2.1, 1.9.1, 2.2,3.1, 3.1, 10

[28] Google. Google I/O 2014 Keynote Talk.https://www.google.com/events/io#wtLJPvx7-ys, 2014. Accessed:February 11, 2016. 1.5, 6.6

[29] C. Grelck, S.-B. Scholz, and A. Shafarenko. Asynchronous stream process-ing with s-net. International Journal of Parallel Programming, 38(1):38–67,2010. 1.9.1, 10

[30] R. Hempel. The mpi standard for message passing. In International Con-ference on High-Performance Computing and Networking, pages 247–252.Springer, 1994. 1.1.2, 1.1.3, 2.2, 2.3, 3.1

[31] Herb Sutter. The Free Lunch Is Over: A Fundamental Turn TowardConcurrency in Software. http://www.gotw.ca/publications/concurrency-ddj.htm, Aug. 2009. Accessed: 2018-01-18. 1.1, 2.1, 11.2

[32] C. A. R. Hoare. Communicating sequential processes. In The origin ofconcurrent programming, pages 413–443. Springer, 1978. 1.9.1, 10

[33] T. Hoefler and R. Belli. Scientific benchmarking of parallel computing sys-tems: Twelve ways to tell the masses when reporting performance results.In Proceedings of the International Conference for High Performance Com-puting, Networking, Storage and Analysis, SC ’15, pages 73:1–73:12, NewYork, NY, USA, 2015. ACM. 1.8.2, 9.2

[34] S. Imam and V. Sarkar. The eureka programming model for speculativetask parallelism. In LIPIcs-Leibniz International Proceedings in Informat-ics, volume 37. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2015.1.1.3, 2.3, 3.1, 4.2.2

[35] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributeddata-parallel programs from sequential building blocks. In ACM SIGOPSOperating Systems Review, volume 41, pages 59–72. ACM, 2007. 1.9.1, 10

126 BIBLIOGRAPHY

[36] N. Javed and F. Loulergue. OSL: Optimized Bulk Synchronous ParallelSkeletons on Distributed Arrays, pages 436–451. Springer Berlin Heidelberg,Berlin, Heidelberg, 2009. 1.9.1, 2.2, 10

[37] H. T. Kung. Computational models for parallel computers. PhilosophicalTransactions of the Royal Society of London. Series A, Mathematical andPhysical Sciences, pages 357–371, 1988. 1.2.1, 3.1

[38] D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallellibrary. Acm Sigplan Notices, 44(10):227–242, 2009. 1.1.2, 2.2

[39] B. Liu. Sentiment analysis and subjectivity. Handbook of Natural LanguageProcessing, 2010. 2.4

[40] S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: a comprehen-sive study on real world concurrency bug characteristics. In ACM SigplanNotices, volume 43, pages 329–339. ACM, 2008. 1.1.1, 2.1

[41] M. McCool, J. Reinders, and A. Robison. Structured Parallel Programming:Patterns for Efficient Computation. Elsevier, 2012. 1.1.2, 1.2.1, 1.4.3, 1.5,2.2, 2.4, 3.1, 5.3, 6

[42] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundationsand trends in information retrieval, 2(1-2):1–135, 2008. 1.5, 6.6

[43] G. Pérez and S. Yovine. Formal specification and implementation of an au-tomated pattern-based parallel-code generation framework. InternationalJournal on Software Tools for Technology Transfer, pages 1–20, 2017. (doc-ument), 1.1.3, 2.3

[44] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106,1986. 1.5, 2.4, 6.5

[45] A. Ranta. Implementing programming languages. An introduction to com-pilers and interpreters. College Publications, 2012. 1.1.3, 2.3

[46] J. Reinders. Intel threading building blocks: outfitting C++ for multi-coreprocessor parallelism. " O’Reilly Media, Inc.", 2007. 1.1.2, 2.2

[47] V. A. Saraswat, V. Sarkar, and C. von Praun. X10: concurrent program-ming for modern architectures. In Proceedings of the 12th ACM SIGPLANsymposium on Principles and practice of parallel programming, pages 271–271. ACM, 2007. 1.1.2, 2.2

[48] R. Stephens. A survey of stream processing. Acta Informatica, 34:491–541,1997. 1.1.3, 1.9.1, 2.3, 10

[49] W. Thies and S. P. Amarasinghe. An empirical characterization of streamprograms and its implications for language and compiler design. In J. K.Valentina Salapura, Michael Gschwind, editor, 19th International Confer-ence on Parallel Architecture and Compilation Techniques (PACT 2010),pages 365–376. ACM, 2010. 1.9.1, 10

BIBLIOGRAPHY 127

[50] W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language forstreaming applications. In Compiler Construction, pages 179–196. Springer,2002. 1.4.3, 1.5, 1.9.1, 5.3, 6.1, 10

[51] L. G. Valiant. A bridging model for parallel computation. Communicationsof the ACM, 33(8):103–111, 1990. 1.9.1, 1.9.1, 10, 10

[52] L. G. Valiant. A bridging model for multi-core computing. In Algorithms-ESA 2008, pages 13–28. Springer, 2008. 1.9.1, 10

[53] E. F. Walker, R. Floyd, and P. Neves. Asynchronous remote operationexecution in distributed systems. In Proc. 10th Int. Conf. on DistributedComputing Systems, pages 253–259, 1990. 1.9.2, 11.2

[54] S. Yovine, I. Assayad, F.-X. Defaut, M. Zanconi, and A. Basu. A formalapproach to derivation of concurrent implementations in software prod-uct lines. In M. Alexander and W. Gardner, editors, Algebra for Paralleland Distributed Processing, chapter 11, pages 359–401. Chapman and Hall,CRC Press, 2008. 1.1.3, 1.2.2, 2.2, 2.3, 3.2, 4.2.2

[55] M. Zandifar, M. Abdul Jabbar, A. Majidi, D. Keyes, N. M. Amato,and L. Rauchwerger. Composing algorithmic skeletons to express high-performance scientific applications. In Proceedings of the 29th ACM onInternational Conference on Supercomputing, pages 415–424. ACM, 2015.2.2

[56] M. Zandifar, N. Thomas, N. M. Amato, and L. Rauchwerger. The staplSkeleton Framework, pages 176–190. Springer International Publishing,Cham, 2015. 1.2.1, 3.1

[57] M. Zandifar, N. Thomas, N. M. Amato, and L. Rauchwerger. The STAPLSkeleton Framework, pages 176–190. Springer, Cham, 2015. 1.9.1, 10

[58] S. Chatterjee, N. Vrvilo, Z. Budimlic, K. Knobe, and V. Sarkar. Declarativetuning for locality in parallel programs. In Parallel Processing (ICPP), 201645th International Conference on, pages 452–457. IEEE, 2016. 10

[59] N. Vasilache, M. M. Baskaran, T. Henretty, B. Meister, H. Langston,S. Tavarageri, and R. Lethin. A tale of three runtimes. CoRR,abs/1409.1914, 2014. 10

List of Figures

1.1. Generaciones de CPU de Intel y su evolución en velocidad dereloj y número de transistores (fuente: [31]). . . . . . . . . . . . . 3

1.2. El patrón PCR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3. Ejemplo de llamada PCRs dentro de las funciones básicas . . . . . 101.4. PCR para dividir y conquistar. . . . . . . . . . . . . . . . . . . . . 111.5. (Izquierda) Conexión de preProcess y doStatistics por medio del

delegate makeBucket. (Derecha) Un posible orden parcial. . . . . . 121.6. PCR “ Alcanzabilidad hacia atrás ” . . . . . . . . . . . . . . . . . . 141.7. Tubería de Sentiment analysis . . . . . . . . . . . . . . . . . . . . 151.8. Diagrama de un grafo CnC. Los diamantes, óvalos y cajas represen-

tan respectivamente colecciones de tags, steps e ítems. Las líneaspunteadas indican relaciones prescribe (de control) entre las co-lecciones de tags y steps; las flechas denotan las relaciones produce/ consume / control entre las colecciones de steps, items y tags. . 16

1.9. Uso de la información de ajuste de PCR en la implementación de CnC 241.10. PCR vs CNC comparaciones de tiempo y memoria para diferentes

números de núcleos informáticos disponibles. . . . . . . . . . . . 261.11. Comparación de tiempo de ejecución entre implementaciones de

N-Queens PCR con profundidad de recursión variable para dife-rentes N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.12. Resumen del mejor tiempo de ejecución en cada implementaciónpara N = 16.32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.13. Comparación de tiempo entre la implementación de feedbackloop

N-Queens y una implementación de CnC codificada a mano. . . . 281.14. Comparación de las implementaciones de recuento de palabras al

aumentar el número de núcleos disponibles. . . . . . . . . . . . . 291.15. Comparación de dos implementaciones de recuento de palabras

con diferentes tamaños de fragmentos. . . . . . . . . . . . . . . . 30

2.1. Intel CPU generations and their evolution in clock speed andnumber of transistors (source: [31]). . . . . . . . . . . . . . . . . 37

3.1. Fibonacci primes counter in FXML. . . . . . . . . . . . . . . . . . . 453.2. Diagram of Fibonacci primes counter semantics. . . . . . . . . . 46

4.1. The PCR pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2. PCR for counting Fibonacci primes, depicting a nesting opportunity 514.3. Fibonacci primes counter written in PCR syntax. . . . . . . . . . . 52

129

130 LIST OF FIGURES

4.4. Generic FXML specification of reduce. . . . . . . . . . . . . . . . . . 554.5. Flattened PCR code for countFibPrimes from Example 3 . . . . . . 55

5.1. Example of calling PCRs inside basic functions . . . . . . . . . . . 575.2. PCR definition for divide and conquer. . . . . . . . . . . . . . . . . 585.3. (Left) Connection of preProcess and doStatistics by means of del-

egate makeBucket. (Right) One possible partial order. . . . . . . . 595.4. Complete FXML specification of Figure 5.3 with its connect expansion. 60

6.1. Left: PCR “Low Pass Filter” with its consumer LPF. Right: Depen-dency diagram for NUM_TAPS = 3. . . . . . . . . . . . . . . . 65

6.2. PCR “Backwards Reachability” . . . . . . . . . . . . . . . . . . . . 666.3. Count-words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.4. N-Queens using PCR-based divide and conquer. . . . . . . . . . . . 676.5. N-Queens with feedbackloop. . . . . . . . . . . . . . . . . . . . . . 686.6. N-Queens with iterate. . . . . . . . . . . . . . . . . . . . . . . . . 686.7. ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.8. PCR data parallel operations for the ID3 algorithm. . . . . . . . . 706.9. Sentiment analysis pipeline (right) and its PCR (left) . . . . . . . . 71

7.1. A diagram of a CnC graph. Diamonds, ovals and boxes respectivelyrepresent tag, step and item collections. Dotted lines indicateprescribes (control) relationships between tag and step collec-tions and arrows denote produce/consume/control relationshipsbetween step, item and tag collections. . . . . . . . . . . . . . . . 76

7.2. CnC implementation of nesting of isPrime as consumer in countFibPrimes. 797.3. Translated CnC code for the countFibPrimes example . . . . . . . . 80

8.1. countFibPrimes written as a C++ template. . . . . . . . . . . . . . . 848.2. Host language code and binding for fib producer. . . . . . . . . . 848.3. Sentiment analysis pipeline code example. . . . . . . . . . . . . . 868.4. C++ CnC abridged implementation of the pcr_var type. . . . . . . . 888.5. PCR tuning information usage in the CnC implementation . . . 898.6. CnC context using consumed_indexes to declare the input dependen-

cies for its computing step . . . . . . . . . . . . . . . . . . . . . . 898.7. Code fragment of a CnC consumer using GROUP_SIZE to implement

chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 908.8. Code fragment of a PCR context posting its expected_reads, consumer_indexes

and locations information to each step producing its input . . . . 908.9. Code fragment of a CnC tuner using the performance hints . . . . 91

9.1. PCR vs CNC time and memory comparisons for different num-bers of available computing cores. . . . . . . . . . . . . . . . . . . 94

9.2. Compilation time and space costs and excutable sizes for differentPCR pipeline lengths . . . . . . . . . . . . . . . . . . . . . . . . . 95

9.3. Run time comparison between N-Queens PCR implementationswith varying recursion depth for different N . . . . . . . . . . . . 96

9.4. Summary of the best running time in each implementation forN=16..32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

LIST OF FIGURES 131

9.5. Time comparison between the feedbackloop N-Queens implemen-tation and a hand-coded CnC implementation. . . . . . . . . . . 97

9.6. Comparison of count-words implementations for increasing num-bers of available computing cores. . . . . . . . . . . . . . . . . . . 97

9.7. Comparison of two count-words implementations with varyingchunk sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

9.8. Graphical representations for CnC and PCR parallel CountPrimes 999.9. CnC and PCR user code integration compared. . . . . . . . . . . 1009.10. CnC and PCR pure infrastructure code compared. . . . . . . . . 1019.11. Runtime comparison of the CnC and PCR implementations. . . . 101

A.1. Examples of executions of Writer-Reader. . . . . . . . . . . . . . 119A.2. Examples of executions of Smith-Waterman. . . . . . . . . . . . . 120

List of Tables

1.1. Gramática de un PCR. . . . . . . . . . . . . . . . . . . . . . . . . . 81.2. Bloques básicos de un PCR y su especificación FXML . . . . . . . . . 91.3. Semantics of feedbackloop. . . . . . . . . . . . . . . . . . . . . . . 121.4. Resumen de plantillas C++ usadas para especificar PCRs. . . . . . . 211.5. Parámetro de función pcr_var operadores de envoltura para fun-

ciones básicas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.6. Ajustes de rendimiento opcionales de PCR . . . . . . . . . . . . . . 23

4.1. PCR grammar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2. PCR building blocks and their FXML specification. . . . . . . . . . . 53

5.1. Semantics of feedbackloop. . . . . . . . . . . . . . . . . . . . . . . 61

8.1. Summary of C++ templates used for PCR specifications. . . . . . . . 838.2. Function parameter pcr_var wrapper operators for basic functions. 858.3. PCR optional performance tuning hints . . . . . . . . . . . . . . . 878.4. Overview of CnC implementations of PCR building blocks . . . . . . 87

133

especificación, diseño e implementación de un entorno de … · 2019-05-24 · programación...

Documents