computo paralelo

Upload: taniabby

Post on 07-Apr-2018

278 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Computo paralelo

    1/173

    Historia del Cmputo Paralelo

    Inicia en el ao de 1955 con Gene Amdahl en los Estados Unidos,trabajando en la compaa IBM.

    Que es el procesamiento en paralelo?

    Es la divisin del trabajo en pequeas tareas.Asignar varias pequeas tareas a mltiples empleados para trabajar

    simultneamente.

    El procesamiento en paralelo es el uso de mltiples procesadores para

    ejecutar diferentes partes del mismo programa simultneamente.

  • 8/4/2019 Computo paralelo

    2/173

    Dificultades:

    coordinacin,controlar ymonitorear trabajadores.

    Las principales metas del procesamiento en paralelo son:

    Resolver problemas ms grandes ms rpido.Reducir el tiempo de ejecucin de los programas de cmputo.Incrementar el tamao de los problemas computacionales que se pueden

    resolver.

    En la actualidad existen varias lneas de investigacin dentro del cmputoparalelo, as como tambin diversos equipos, lenguajes y aplicaciones.

  • 8/4/2019 Computo paralelo

    3/173

    Ventajas

    La idea de crear el cmputo paralelo surge de las siguientes necesidades:oefectuar operaciones mucho ms rpido,oprocesar grandes volmenes de informacin yoobtener resultados en el menor tiempo posible

    Estas necesidades son el por qu y a la vez las principales metas delcmputo paralelo.

    Definicin. Computadora paralela:

    Es una coleccin de procesadores interconectados de alguna forma dentro deun mismo gabinete, para intercambiar informacin.

  • 8/4/2019 Computo paralelo

    4/173

    Taxonoma (del griego , taxis, "ordenamiento", y , nomos, "norma"o "regla"). Significa ciencia de la clasificacin.

    La clasificacin de computadoras debe ser en base a las caractersticas msnotorias y no en las ms detalladas que aparecen en las hojas de datos (datasheets). Existen varias taxonomas o clasificaciones de computadoras, como lade Skillicorn, la de Shore (6 tipos), la de Handler, y la taxonoma estructural deHockney and Jesshope. La ms importante de estas clasificaciones es la de

    Flynn.

    Taxonoma de Flynn: clasifica las computadoras (su arquitectura y sussistemas operativos) de acuerdo a la capacidad de un sistema de procesar:uno o ms flujos simultneos de datos (data streams).

  • 8/4/2019 Computo paralelo

    5/173

    uno o ms flujos simultneos de instrucciones (instruction stream).

    SISD: un flujo de instrucciones nico trabaja sobre flujo de datos nico(arquitectura clsica, superescalares).

  • 8/4/2019 Computo paralelo

    6/173

    SIMD: un flujo de instrucciones nico trabaja sobre un flujo de datosmltiple (computadores matriciales).

    MISD: un flujo de instrucciones mltiple trabaja sobre un flujo. de datosnico (clase no implementada, resultado de la clasificacin)

    MIMD: un flujo de instrucciones mltiple trabaja sobre un flujo de datosmltiple (multiprocesadores).

  • 8/4/2019 Computo paralelo

    7/173

    Paradigmas del cmputo paralelo.Paradigmas del software paralelo.

    Existen varios mtodos de programar computadoras paralelas. Los dos mscomunes son:

    paso de mensajes y

    paralelismo de datos.

    Paso de mensajes (message passing)el usuario hace llamado a libreraspara especficamente compartir informacin entre procesadores.

    Paralelismo de datos (data parallel) la particin de datos determina el

    paralelismo.Memoria compartida (shared memory) multiples procesadorescomparten un espacio de memoria comn.

  • 8/4/2019 Computo paralelo

    8/173

    Operacin remota de memoria (remote memory operation). Un conjunto deprocesadores en los cuales un proceso puede accesar la memoria de otro

    proceso sin su participacin.Hilos ( threads). Un slo proceso tiene mltiples (y concurrentes) rutas de

    ejecucin.Modelos combinados (combined models). Compuestos de dos o ms

    modelos de los mencionados arriba.

    Nota: estos modelos son independientes de la mquina/arquitectura; cualquierade los modelos puede ser implementado en cualquier hardware con el soportede un sistema operativo apropiado. Una implementacin efectiva es aquellaque se acerque ms al modelo de hardware y le d al usuario facilidad en laprogramacin.

  • 8/4/2019 Computo paralelo

    9/173

    Message PassingThe message passing model is defined as:

    set of processes using only local memory.

    processes communicate by sending and receiving messages. data transfer requires cooperative operations to be performed by each

    process (a send operation must have a matching receive).Programming with message passing is done by linking with and making calls tolibraries which manage the data exchange between processors. Message passinglibraries are available for most modern programming languages.

    Data Parallel

    The data parallel model is defined as: Each process works on a different part of the same data structure

    Commonly a Single Program Multiple Data (SPMD) approach

    Data is distributed across processors All message passing is done invisibly to the programmer Commonly built "on top of" one of the common message passing libraries

    Programming with data parallel model is accomplished by writing a program withdata parallel constructs and compiling it with a data parallel compiler.

  • 8/4/2019 Computo paralelo

    10/173

    Clasificacin del Paralelismo.

    Temporal (pipeline).oUn programa se ejecuta de manera secuencial y en cierto momento se

    dividen las tareas a varias unidades de procesamiento. Al terminar laejecucin en cada unidad, de nuevo se retoma la ejecucin secuencial.

    EspacialoEl paralelismo espacial se produce cuando se tiene varios procesadores

    y se puede ejecutar un proceso en cada uno de ellos de forma ms omenos independiente. En el caso ptimo, el tiempo de ejecucin sedivide por el nmero de procesadores que estn trabajando.

    IndependienteoEste paralelismo no depende de la topologa de la red de procesadores,

    ya que el programa no adapta su estructura al de las conexiones de lared, es decir, no importa cmo estn conectados los procesadores, laejecucin del programa en paralelo se realiza.

  • 8/4/2019 Computo paralelo

    11/173

    Niveles de paralelismo:

    Existen dos cualidades para la programacin en paralelo.Granularidad: Es el tamao relativo de la unidad de cmputo que ejecuta

    en paralelo. Esta unidad puede ser una declaracin, una funcin o unproceso completo.

    Canal de comunicacin: Es el mecanismo bsico por el cual las unidadesindependientes del programa intercambian datos y sincronizan su

    actividad.

    Nivel de declaraciones (statement).oEs el nivel mas fino de granularidad.oSe utiliza en lenguajes como Power C, Fortran 79/90, Power Fortran

    79/90.oSe usan variables en comn dentro de un sistema nico de memoria.

  • 8/4/2019 Computo paralelo

    12/173

    Nivel de hilos (thread)oUn hilo es un estado independiente de ejecucin, dentro del contexto

    de un programa ms grande, esto es:Un conjunto de registros mquinaUna pila (stack) de llamadasLa habilidad de ejecutar cdigo.

    oUn programa puede crear varios hilos para ejecutarse en el mismoespacio de direccin. Las razones de usar hilos son la portabilidad y el

    desempeo. Se comparten los recursos de un solo procesador.Nivel de procesos (process).

    oUn proceso en UNIX consiste en:Un espacio de direccinUn gran nmero de valores de estado de proceso.Un hilo de ejecucin.

  • 8/4/2019 Computo paralelo

    13/173

    oEl mecanismo para comunicacin entre procesos puede usarse paraIntercambiar datos Coordinar las actividades de mltiples procesos

    asncronos.oUn proceso puede crear uno o ms procesos, el proceso que crea otro

    se llama proceso padre y el creado se llama proceso hijo. El procesoinicial se llama raz.

  • 8/4/2019 Computo paralelo

    14/173

    Desempeo.

    El desempeo de un programa en paralelo se mide bajo tres trminos:

    Aceleracin (Speed up) Eficiencia Costo

    Donde:

    T = tiempo en segundosT1= Tiempo usando una unidad de procesamientoTp = Tiempo usando dos o ms de unidades de procesamientoP = nmero de unidades de procesamiento

  • 8/4/2019 Computo paralelo

    15/173

    Ley de Amdahl.La ley de Amdahl se refiere a la aceleracin de usar procesadores en

    paralelo en un problema a comparacin de usar solo un procesador serial.Para entender lo que es aceleracin, primero veremos que es velocidad:

    oLa velocidad de un programa es el tiempo que le toma para serejecutado. Esto puede ser medido en cualquier incremento detiempo.

    oLa aceleracin (speedup) se define como el tiempo que le toma a un

    programa ejecutarse en serial (con un procesador) dividido por eltiempo que le toma ejecutarse en paralelo (con varios procesadores).

  • 8/4/2019 Computo paralelo

    16/173

  • 8/4/2019 Computo paralelo

    17/173

    Arquitecturas Paralelas.

    Criterios Antiguos:

    Arquitecturas paralelas ligadas a modelos de programacin.Arquitecturas divergentes, sin ningn patrn de crecimiento.

  • 8/4/2019 Computo paralelo

    18/173

    Arquitecturas Paralelas.

    Criterios actuales:

    Extensin de la arquitectura de computadores para soportarcomunicaciones y cooperacin.oANTES: Conjunto de instrucciones.oAHORA: Comunicaciones.

    Hay que definir:oAbstracciones, fronteras, primitivas (interfaces)oEstructuras que implementan los interfaces (hw o sw)

    Compiladores, libreras y OS son cuestiones importantes en nuestros das.

  • 8/4/2019 Computo paralelo

    19/173

    Arquitecturas Paralelas.

    Recordemos que:

    Un computador paralelo es un conjunto de elementos de proceso que secomunican y cooperan para resolver rpidamente grandes problemas.

    Podemos decir que la Arquitectura Paralela es:

    Arquitectura convencional+

    Arquitectura de comunicacin

  • 8/4/2019 Computo paralelo

    20/173

  • 8/4/2019 Computo paralelo

    21/173

    Arquitecturas Paralelas.

    Modelos de programacinEspecifica las comunicaciones y la sincronizacin.Ejemplos:

    oMultiprogramacin: no hay comunicacin o sincronismo. Paralelismoa nivel de programa.

    oMemoria compartida: como un tabln de anuncios.oPaso de mensajes: como cartas o llamadas telefnicas, punto a punto.oParalelismo de datos: varios agentes actan sobre datos individuales y

    luego intercambian informacin de control antes de seguir el proceso.El intercambio se implementa con memoria compartida o con paso

    de mensajes.

  • 8/4/2019 Computo paralelo

    22/173

    Arquitecturas Paralelas.

    Niveles de abstraccin en la comunicacin.

  • 8/4/2019 Computo paralelo

    23/173

    Arquitecturas Paralelas.

    Evolucin de los modelos arquitectnicos:

    Modelo de programacin, comunicacin y organizacin de la mquinacomponen la arquitectura.Espacio de memoria compartida.

    Paso de mensajes.Paralelismo de datos.Otras:

    Flujo de datos.Arrays sistlicos.

  • 8/4/2019 Computo paralelo

    24/173

    Memoria Compartida.

    Cualquier procesador puede referenciar directamente cualquier posicin dememoria.oLa comunicacin se realiza implcitamente por medio de cargas y

    almacenamientos.

    Ventajas:oLocalizacin transparente.oProgramacin similar a tiempo compartido en uniprocesadores.

    Excepto que los procesos se ejecutan en diferentes procesadores.Buen rendimiento en distribucin de carga.

  • 8/4/2019 Computo paralelo

    25/173

  • 8/4/2019 Computo paralelo

    26/173

    Memoria Compartida.

  • 8/4/2019 Computo paralelo

    27/173

    Memoria Compartida.

    Hardware de Comunicacin:

    Aumento independiente de capacidades de memoria, de I/O o de procesoaadiendo mdulos, controladores o procesadores.

  • 8/4/2019 Computo paralelo

    28/173

    Memoria Compartida.

    Estrategia de comunicaciones en Mainframe:

    Red de barras cruzadas.Inicialmente limitado por el coste de los procesadores. Despus, por el

    coste de la red.El ancho de banda crece conp.Alto coste de ampliacin; uso de redes multietapa.

  • 8/4/2019 Computo paralelo

    29/173

    Memoria Compartida.

    Estrategia de comunicaciones en Minicomputer:

    Casi todos los sistemas con microprocesadores usan bus.Muy usados para computacin paralela.Llamados SMP, symmetric multiprocessor.El bus puede ser un cuello de botella.Problema de la coherencia en cache.Bajo coste de ampliacin.

  • 8/4/2019 Computo paralelo

    30/173

    Memoria Compartida.

    Ejemplo: Intel Pentium Pro Quad. Coherencia y multiproceso integrados en el modulo procesador.

  • 8/4/2019 Computo paralelo

    31/173

    Memoria Compartida.

    Ejemplo: SUN Enterprise. 16 tarjetas de cualquier tipo:procesadores + memoria, o I/O. El acceso a la memoria es por bus, simtrico.

  • 8/4/2019 Computo paralelo

    32/173

    Memoria Compartida.

    Otras opciones en comunicacin:

    Problemas de interconexin: coste (barras cruzadas) o ancho de banda (bus). Dance-hall: ampliable a menor coste que en barras cruzadas.

    oLatencia en acceso a memoria uniforme, pero alta. NUMA (non-uniform memory access):

    oConstruccin de un simple espacio de memoria con latencias diferentes. COMA: (Cache Only Memory Architecture)

    oArquitectura de memoria a base de caches compartidas.

  • 8/4/2019 Computo paralelo

    33/173

    Memoria Compartida.

    Ejemplo: Cray T3E Ampliable a 1024 procesadores, enlaces de 480MB/s.

    El controlador de memoria genera las peticiones para posiciones no locales. No tiene mecanismo de hardware para coherencia (SGI Origin y otros s lo

    proporcionan).

  • 8/4/2019 Computo paralelo

    34/173

    Paso de Mensajes.

    Construidos por medio de computadores completos, incluyendo I/O.oComunicacin por medio de operaciones explcitas de I/O.

    Modelo de programacin: acceso directo slo a direcciones privadas(memoria local), comunicacin por medio de mensajes (send/receive)

    Diagrama de bloques similar al NUMAoPero las comunicaciones se integran a nivel de I/O.oComo redes de workstations (clusters), pero mayor integracin.oMs fciles de construir y ampliar que los sistemas NUMA.

    Modelo de programacin menos integrado en el hardware.oLibreras o intervencin del sistema operativo.

  • 8/4/2019 Computo paralelo

    35/173

    Paso de Mensajes.

    send especifica el buffer a transmitir y el proceso receptor.recv especifica el proceso emisor y el bufferde almacenamiento.Son copias memoria-memoria, pero se necesitan los nombres de procesos.Opcionalmente se puede incluir el destino en el envo y unas reglas de

    identificacin en el destino.En la forma simple, el emparejamiento se consigue por medio de la

    sincronizacin de sucesos send/recv.oExisten mltiples variantes de sincronizacin.

    Grandes sobrecargas: copia, manejo de buffer, proteccin.

  • 8/4/2019 Computo paralelo

    36/173

    Paso de Mensajes.

  • 8/4/2019 Computo paralelo

    37/173

    Paso de Mensajes.Evolucin en las mquinas de Paso de Mensajes

    Primeras mquinas: FIFO en cada enlace.oModelo de programacin muy prximo

    al hw; operaciones simples desincronizacin.

    oReemplazado por DMA, permitiendooperaciones no bloqueantes.Buffer de almacenamiento en

    destino hasta recv.Disminucin de la influencia de la topologa

    (enrutado por hw).oStore&forward routing: importa la topologa.oIntroduccin de redes multietapa.oMayor coste: comunicacin nodo red.oSimplificacin de la programacin

  • 8/4/2019 Computo paralelo

    38/173

    Paso de Mensajes.

    Ejemplo: IBM SP-2. Realizado a base de estaciones RS6000.

  • 8/4/2019 Computo paralelo

    39/173

    Paso de Mensajes.

    Ejemplo Intel Paragon.

  • 8/4/2019 Computo paralelo

    40/173

    La convergencia de las arquitecturas.

    La evolucin y el papel del software ha difuminado las fronteras entrememoria compartida y paso de mensajes.osend/recv soporta memoria compartida va buffers.oSe puede construir un espacio global de direcciones en Paso de

    Mensajes.

    Tambin converge la organizacin del hardware.oMayor integracin para Paso de Mensajes (menor latencia, mayor

    ancho de banda)oA bajo nivel, algunos sistemas de memoria compartida implementan

    paso de mensajes en hardware.

    Distintos modelos de programacin, pero tambin en convergencia.

  • 8/4/2019 Computo paralelo

    41/173

    Interconexin de sistemas paralelos

    La misin de la red en una arquitectura paralela es transferir informacindesde cualquier fuente a cualquier destino minimizando la latencia y concoste proporcionado.

    La red se compone de:onodos;oconmutadores;oenlaces.

    La red se caracteriza por su:otopologa: estructura de la interconexin fsica;oenrutado: que determina las rutas que los mensajes pueden o deben

    seguir en el grafo de la red;oestrategia de conmutacin: de circuitos o de paquetes;ocontrol de flujo: mecanismos de organizacin del trfico.

  • 8/4/2019 Computo paralelo

    42/173

    Interconexin de sistemas paralelos

    Clasificacin de las redes por su topologa.

    Estticas:

    oconexiones directas estticas punto a punto entre los nodos;ofuerte acoplamiento interfaz de red-nodo;olos vrtices del grafo de la red son nodos o conmutadores;

    Se clasifican a su vez:

    simtricas: anillo, hipercubo, toro;no simtricas: bus, rbol, malla.

  • 8/4/2019 Computo paralelo

    43/173

    Interconexin de sistemas paralelos

    Clasificacin de las redes por su topologa.

    Dinmicas:olos conmutadores pueden variar dinmicamente los nodos que

    interconectan.

    Se clasifican a su vez:monoetapa;multietapa:

    bloqueante (lnea base, mariposa, baraje);reconfigurable (Bene);no bloqueante (Clos).

  • 8/4/2019 Computo paralelo

    44/173

    Interconexin de sistemas paralelos

    Parmetros caractersticos de una red:

    Tamao de la red: nmero de nodos que la componen.

    Grado de un nodo: nmero de enlaces que inciden en el nodo.

    Dimetro de la red: es el camino mnimo ms largo que se puedeencontrar entre dos nodos cualesquiera de la red.

    Simetra: una red es simtrica si todos los nodos son indistinguibles desdeel punto de vista de la comunicacin.

  • 8/4/2019 Computo paralelo

    45/173

    Redes estticas

  • 8/4/2019 Computo paralelo

    46/173

    Redes estticas

  • 8/4/2019 Computo paralelo

    47/173

    Redes estticasHipercubo 3D ciclo-conexo

  • 8/4/2019 Computo paralelo

    48/173

    Redes estticasEjemplo de conexiones en un hipercubo 3

    Conexin de nodos que se diferencian en el bit menos significativo

    Conexin de nodos que se diferencian en el segundo bit

    Conexin de nodos que se diferencian en el bit ms significativo

  • 8/4/2019 Computo paralelo

    49/173

    Redes dinamicas

    Redes dinmicas: son redes cuya configuracin puede modificarse.

    Hay dos tipos:omonoetapa.omultietapa.

    Las redes monoetapa realizan conexiones entre elementos de procesoen una sola etapa.oPuede que no sea posible llegar desde cualquier elemento a cualquier

    otro, por lo que puede ser necesario recircular la informacin (=>redesrecirculantes)

    Las redes multietapa realizan conexiones entre los elementos deproceso en ms de una etapa.

  • 8/4/2019 Computo paralelo

    50/173

    Redes dinmicas

    Redes de interconexin monoetapa

  • 8/4/2019 Computo paralelo

    51/173

    Redes dinmicas

    Red de barras cruzadas: permite cualquier conexin.

  • 8/4/2019 Computo paralelo

    52/173

    Redes dinmicasRedes de interconexin (multietapa)

    Cajas de conmutacin

    Las cuatro configuraciones posibles de una caja de conmutacin de 2entradas.

  • 8/4/2019 Computo paralelo

    53/173

    Redes dinmicas bloqueantes

    Redes multietapa bloqueantes.

    oSe caracterizan porque no es posible establecer siempre una nuevaconexin entre un par fuente/destino libres, debido a conflictos con lasconexiones en curso.

    oGeneralmente existe un nico camino posible entre cada parfuente/destino.

  • 8/4/2019 Computo paralelo

    54/173

    Redes dinmicas bloqueantes

    Red de lnea base:

  • 8/4/2019 Computo paralelo

    55/173

    Redes dinmicas bloqueantes

    Red mariposa:

  • 8/4/2019 Computo paralelo

    56/173

    Redes dinmicas bloqueantes

    Red baraje perfecto:

  • 8/4/2019 Computo paralelo

    57/173

    Redes dinmicas reconfigurables

    Redes multietapa reconfigurables.

    oSe caracterizan porque es posible establecer siempre una nuevaconexin entre un par fuente/destino libres, aunque haya conexiones encurso, pero puede hacerse necesario un cambio en el camino usado poralguna(s) de ellas (reconfiguracin).

    oInteresante en procesadores matriciales, en donde se conocesimultneamente todas las peticiones de interconexin.

  • 8/4/2019 Computo paralelo

    58/173

    Redes dinmicas reconfigurables

    Red de Bene:

  • 8/4/2019 Computo paralelo

    59/173

    Redes dinmicas reconfigurables

    La red de Bene se puede construir recursivamente:

  • 8/4/2019 Computo paralelo

    60/173

    Redes dinmicas no bloqueantes

    Redes dinmicas no bloqueantes.oSe caracterizan porque es posible establecer siempre una nueva

    conexin entre un par fuente/destino libres sin restricciones.

    oSon anlogas a los conmutadores de barras cruzadas, pero pueden

    presentar mayor latencia, debido a las mltiples etapas.

  • 8/4/2019 Computo paralelo

    61/173

    Redes dinmicas no bloqueantes

    Red de Clos:

  • 8/4/2019 Computo paralelo

    62/173

    Coherencia en Memoria Cache

    Estructuras comunes de la jerarqua de memoria en multiprocesadoresMemoria cache compartida.

    Memoria compartida mediante bus.

    Interconexin por medio de red (dance-hall)

    Memoria distribuida.

  • 8/4/2019 Computo paralelo

    63/173

    Coherencia en Memoria Cache

    Cache compartidaPequeo nmero de procesadores

    (28)Fue comn a mediados de los

    80 para conectar un par de

    procesadores en placa.Posible estrategia en chip

    multiprocesadores.

  • 8/4/2019 Computo paralelo

    64/173

    Coherencia en Memoria Cache

    Comparticin por medio de bus.

    Ampliamente usada enmultiprocesadores de pequeay mediana escala (20-30)

    Forma dominante en lasmquinas paralelas actuales.

    Los microprocesadoresmodernos estn dotados parasoportar protocolos decoherencia en estaconfiguracin.

  • 8/4/2019 Computo paralelo

    65/173

    Coherencia en Memoria Cache

    Saln de baile

    Fcilmente escalable.

    Estructura simtrica UMA.Memoria demasiado lejana

    especialmente en grandes sistemas.

  • 8/4/2019 Computo paralelo

    66/173

    Coherencia en Memoria Cache

    Memoria distribuidaEspecialmente atractiva par multiprocesadores escalables.Estructura no simtrica NUMA.Accesos locales rpidos.

  • 8/4/2019 Computo paralelo

    67/173

    Arquitecturas paralelas de computadoras.

    Tipos de organizacin de procesadores: Existen 7 importantes mtodos deorganizacin de procesadores para conectarprocesadores en una computadoraparalela.

    Redes de Rejilla (Mesh Networks).Redes de rbol binario (Binary Tree Networks).

    Redes de hiperrbol (Hypertree Networks).Redes de Pirmide (Pyramid Networks).Redes de mariposa (Butterfly Network).Redes de hipercubo (Hypercube Networks).Redes de ciclos de cubos conectados o cubo en ciclos (Cube-Connected

    Cycles Networks).

  • 8/4/2019 Computo paralelo

    68/173

    A la organizacin de procesadores tambin se le conoce como topologa, y deacuerdo a sta, existen criterios para evaluar cul modelo es ms conveniente

    que otro. Las aplicaciones tambin pueden definir si un modelo conviene o no.Criterios de evaluacin de modelos:

    Dimetro.Ancho de biseccin.

    Nmero de aristas por nodo.Mxima longitud de aristas.Grado de una arquitectura.

  • 8/4/2019 Computo paralelo

    69/173

    Mesh Networks.

    Nodes are arranged into a q-dimensional lattice.Communication is allowed only between neighboring nodesTwo-dimensional meshes.

    oMesh with no wrap-around connections.oMesh with wrap-around connections between processors in same

    row or column.

    oMesh with wrap-around connections between processors in adjacentrows or columns.oWrap-around connections can connect processors in the same row or

    column adjacent rows or columns.

  • 8/4/2019 Computo paralelo

    70/173

    Evaluation of the Mesh: Interior nodes communicate with2q other processors The diameter of a q-dimensional mesh withkq nodes is q(k - 1) The bisection width of a q-dimensional mesh withkq nodes iskq-1. The maximum number of edges per node is2q. The maximum edge length is a constant, independent of the number of nodes, for two- and three-

    dimensional meshes. The two-dimensional mesh has been apopular topology for processor arrays.

    o Goodyear Aerospace's MPPo AMT DAPo MasPar's MPo The Intel Paragon XP/S multicomputer connects processors with a two dimensional mesh.

  • 8/4/2019 Computo paralelo

    71/173

  • 8/4/2019 Computo paralelo

    72/173

  • 8/4/2019 Computo paralelo

    73/173

    Hypertree NetworksAn approach

    to building a network with the low diameter of a binary treewith an improved bisection width.The easiest way to think of a hypertree network of degree k and depth dis toconsider the network from two different angles.

    From the front a hypertree network of degreek and depthdlooks like acompletek-ary tree of heightd.

    From the side, the same hypertree network looks like an upside downbinary tree of heightd.Joining the front and side views yields the complete network.

  • 8/4/2019 Computo paralelo

    74/173

    Hypertree evaluation

    4-ary hypertree with depthdhas

    4dleaves2d (2d+1 - 1) nodes.diameter is2d.bisection width is2d+1.number of edges per node is never more than six.maximum edge length is an increasing function of the problem size.

  • 8/4/2019 Computo paralelo

    75/173

    Hypertree network of degree 4 and depth 2(a) Front view(b) Side view(c) Complete network.Connection Machine CM-5 multicomputer is a4-ary hypertree.

  • 8/4/2019 Computo paralelo

    76/173

    Pyramid Networks

    An attempt to combine advantages ofmesh networkstree networks.

    A pyramid network of sizek2 isa complete4-ary rooted tree of height log2 kaugmented with additional interprocessor links

    the processors in every tree level form a 2-D mesh networkA pyramid of size k2 has at its base a 2-D mesh network containing k2processors.

    The total number of processors in a pyramid of sizek2 is (4/3)k2 - (1/3).The levels of the pyramid are numbered in ascending order.The base has level number 0, and the single processor at the apex of the

    pyramid has level number log2 k.

  • 8/4/2019 Computo paralelo

    77/173

    Pyramid network of size 16.

    Every interior processor is connected to nine other processors:one parent,four mesh neighbors,four children.

  • 8/4/2019 Computo paralelo

    78/173

  • 8/4/2019 Computo paralelo

    79/173

    Pyramidevaluation

    The advantage of the pyramid over the 2-D mesh isThe pyramid reduces the diameter of the network.

    When a message must travel from one side of the mesh to the other, fewer linktraversals are required if the message travels up and down the tree rather thanacross the mesh.

    The diameter of a pyramid of sizek2 is2log k.

    The addition of tree links does not give a significantly higher bisectionwidth than a 2-D mesh.The bisection width of a pyramid of sizek2 is2k.The maximum number of links per node is no greater than nine,

    regardless of the size of the network.Unlike a 2-D mesh, the length of the longest edge is an increasing

    function of the network size.

  • 8/4/2019 Computo paralelo

    80/173

    Butterfly Network

    Consists of (k+1)2k nodes divided into k+1 rows, or ranks, eachcontainingn=2k nodes.

    The ranks are labeled 0 throughk.

    The ranks 0 and k are sometimes combined, giving each node four

    connections to other nodes.

  • 8/4/2019 Computo paralelo

    81/173

  • 8/4/2019 Computo paralelo

    82/173

    Node connection

    Letnode(i, j) refer to thejth node on the ith rank, where 0

  • 8/4/2019 Computo paralelo

    83/173

    Butterflyevaluation

    As the rank numbers decrease, the widths of the wings of the butterfliesincrease exponentially.

    The length of the longest network edge increases as the number ofnetwork nodes increases.

    The diameter of a butterfly network with (k + 1)2k nodes is2k.The bisection width is2k-1.

    A butterfly network serves to route data from non local memory to processorson the BBN TC2OOO multiprocessor.

  • 8/4/2019 Computo paralelo

    84/173

    Hypercube

    A cube-connected network, also called a binary n-cube network, is abutterfly with its columns collapsed into single nodes.

    Consists of2k nodes forming ak-dimensional hypercube.The nodes are labeled 0, 1 2k-1;Two nodes are adjacent if their labels differ in exactly one bit position.

  • 8/4/2019 Computo paralelo

    85/173

    A four-dimensional hypercube.

  • 8/4/2019 Computo paralelo

    86/173

    Hypercube evaluation

    The diameter of a hypercube with2k nodes iskThe bisection width of that size network is2k-1,The hypercube organization has low diameterHigh bisection width at the expense of the number of edges per node and

    the length of the longest edge.The number of edges per node isk-the logarithm of the number of nodes

    in the network.

    The length of the longest edge in a hypercube network increases as thenumber of nodes in the network increases.

  • 8/4/2019 Computo paralelo

    87/173

    Cube-Connected Cycles Networks

    The cube-connected cycles network is ak-dimensional hypercube

    2k "vertices" are actually cycles ofk nodes.For each dimension, every cycle has a node connected to a node in the

    neighboring cycle in that dimension.

  • 8/4/2019 Computo paralelo

    88/173

    24-node cube-connected cycles network.

  • 8/4/2019 Computo paralelo

    89/173

    Cycles hypercube evaluation

    Node(i, j) is connected tonode(i, m)

    if and only ifm is the result of inverting the ith most significant bit of thebinary representation ofj.

    Compared to the hypercube,

    the cube-connected cycles processor organization has the advantage,the number of edges per node is three - a constant independent of

    network size.

  • 8/4/2019 Computo paralelo

    90/173

    Disadvantage

    the diameter is twice that of a hypercubethe bisection width is lower.

    Given a cube-connected cycles network of sizek2k,oits diameter is2k

    its bisection width is2k-1

  • 8/4/2019 Computo paralelo

    91/173

    CHARACTERISTICS OF VARIOUS PROCESSOR ORGANIZATIONS

  • 8/4/2019 Computo paralelo

    92/173

    Flujo de datos y paralelismo implcitoVon Neumann vs. Paralelo

    Parallel random access machine (PRAM) Provides a mental break from the Von Neumann model and sequential algorithms. PRAM (pronounced "pea ram") model of parallel computation. Allows parallel-algorithm designers to treat processing power as an unlimited resource Unrealistically simple; Ignores the complexity of interprocessor communication.

    The designer of PRAM algorithms can focus on the parallelism inherent in a particularcomputation. Cost-optimal PRAM solutions exist, meaning that the total number of operations

    performed by the PRAM algorithm is of the same complexity class as an optimalsequential algorithm.

    Cost-optimal PRAM algorithms can serve as a foundation for efficient algorithms onreal parallel computers.

  • 8/4/2019 Computo paralelo

    93/173

    Modelos computacionales RAM y PRAM

    RAM, model of Serial ComputationThe random access machine (RAM) is a model of a one-address computer.

    Consists of:memoryread-only input tape

    write-only output tape program

  • 8/4/2019 Computo paralelo

    94/173

  • 8/4/2019 Computo paralelo

    95/173

    RAM programThe program

    not stored in memory can not be modified.

    The input tape contains a sequence of integers. Every time an input value is read, the input head advances one square. The output head advances after every write. Memory consists of an unbounded sequence of registersr0, r1, r2, . Each register can hold a single integer. Registerr0 is the accumulator, where

    computations are performed.

    The exact instructions are not important, as long as they resemble the instructionsfound on an actual computer.

    load, store, read, write, add, subtract, multiply, divide, test, jump, and halt.

  • 8/4/2019 Computo paralelo

    96/173

    RAM Time Complexity

    The worst-case time complexity of a RAM program is the functionf(n) maximum time taken by the program to execute over all inputs of sizen.

    The expected time complexity of a RAM program is the average, over all inputs of sizen, of the execution times.

    Analogous definitions hold for worst-case space complexity expected space complexity.

    There are two ways of measuring time and space on the RAM model. uniform cost criterion

    logarithmic cost criterion.

  • 8/4/2019 Computo paralelo

    97/173

    Cost criterion

    The uniform cost criterion sayseach RAM instruction requires one time unit to executeevery register requires one unit of space.

    The logarithmic cost criterion takes into account that an actual word ofmemory has a limited storage capacity.

    The uniform cost criterion is appropriate if the values manipulated by theprogram always fit into one computer word.

  • 8/4/2019 Computo paralelo

    98/173

    The PRAM model of parallel computation

    A PRAM consists of acontrol unitglobal memoryan unbounded set of processors each with its own private memoryActive processors execute identical instructionsEvery processor has a unique index

    The value of a processor's index can be used to enable or disable theprocessorInfluence which memory location it accesses

  • 8/4/2019 Computo paralelo

    99/173

  • 8/4/2019 Computo paralelo

    100/173

    PRAM computationA PRAM computation begins with

    input stored in global memorysingle active processing element.

    During each step, an active enabled processorread a value from a single private or global memory locationperform a single RAM operation

    write into one local or global memory location.may activate another processor.

    The processors are synchronized.All active, enabled processors must execute the same instruction, on

    different memory locations.

    The computation terminates when the last processor halts.

  • 8/4/2019 Computo paralelo

    101/173

    Modelos de PRAM differ in how they handle read or write conflicts; when two or more processors attempt to read from, or write to, the same global

    memory location.

    1 EREW (Exclusive Read Exclusive Write): Read or write conflicts are not allowed.

    2 CREW (Concurrent Read Exclusive Write):

    Concurrent reading allowed; i.e., multiple processors may read from the sameglobal memory location during the same instruction step.Write conflicts are not allowed. (This is the default PRAM model.)

    3 CRCW (Concurrent Read Concurrent Write): Concurrent reading and con-current writing allowed. A variety of CRCW models exist with different policies for handling concurrent

    writes to the same global address.

  • 8/4/2019 Computo paralelo

    102/173

    Tipos de PRAM CRCWThree different models:

    Common Arbitrary Priority

    Various CRCW PRAM1.Common. All processors concurrently writing into the same global address must be

    writing the same value.

    2.Arbitrary. If multiple processors concurrently write to the same global address, one of the

    competing processors is arbitrarily chosen as the "winner," and its value iswritten into the register.

    3.Priority. If multiple processors concurrently write to the same global address, the

    processor with the lowest index succeeds in writing its value into the memorylocation.

  • 8/4/2019 Computo paralelo

    103/173

    Strengths of PRAM models

    EREW PRAMmodel is the weakest.Clearly a CREW PRAM can execute any EREW PRAM algorithm in the

    same amount of time; the concurrent read facility is simply not used.

    CRCW PRAMcan execute any CREW PRAMalgorithm in the same amount oftime.

    PRIORITY PRAMmodel is the strongest.

    Any algorithm designed for the COMMON PRAM model will executewith the same complexity on the ARBITRARY PRAM and PRIORITYPRAM models.

  • 8/4/2019 Computo paralelo

    104/173

    Strengths of PRAM models

    if all processors writing to the same location write the same value,choosing an arbitrary processor would cause the same result.

    if an algorithm executes correctly when an arbitrary processor is chosenas the "winner," the processor with the lowest index is as reasonable analternative as any other.

    Any algorithm designed for the ARBITRARY PRAMmodel will execute withthe same time complexity on the PRIORITY PRAMmodel.

    Because the PRIORITY PRAM model is stronger than the EREW PRAMmodel, an algorithm to solve a problem on the EREW PRAM can have highertime complexity tan an algorithm solving the same problem on the PRIORITY

    PRAM model.

  • 8/4/2019 Computo paralelo

    105/173

    Increase in parallel time complexity

    The increase in parallel time complexity can occur when moving from thePRIORITY PRAM model to the EREW PRAM model.

    Lemma. A p-processor EREW PRAM can sort a p-element array stored inglobal memory in (log p) time.

    Theorem. Ap-processor PRIORITY PRAM can be simulated by a p-processorEREW PRAM with the time complexity increased by a factor of(log p).

  • 8/4/2019 Computo paralelo

    106/173

    Simulation PRIORITY PRAM by EREW PRAM

    Assume the PRIORITY PRAM algorithm usesprocessors P1, P2 Ppglobal memory locationsM1, M2 Mm.

    The EREW PRAM usesauxiliary global memory locations T1, T2 Tp and S1, S2 Sp to simulate

    each read or write step of the PRIORITY PRAM.When processor Pi in the PRIORITY PRAM algorithm accesses memorylocation Mj, processor Pi in the EREW PRAM algorithm writes the orderedpair (j,i) in memory location Ti.

  • 8/4/2019 Computo paralelo

    107/173

    Then the EREW PRAM sorts the elements ofT. This step takes time (log p)(Lemma). By reading adjacent entries in the sorted array, the highest priorityprocessor accessing any particular location can be found in constant time.

    ProcessorP1 reads memory location T1, retrieves the ordered pair (i1,j1), andwrites a 1 into global memory location Sj1

    The remaining processorsPk, where 2

  • 8/4/2019 Computo paralelo

    108/173

    A concurrent write operation

    A concurrent write operation, which takes constant time on a p-processor

    PRIORITY PRAM, can be simulated in (log p) time on ap-processor EREWPRAM.

    (a) Concurrent write on the PRIORITY PRAM model.

    ProcessorsP1, P2, andP4 attempt to write values to memory locationM3.ProcessorP1 wins.Processors P3 and P5 attempt to write values to memory location M7.ProcessorP3 wins.

  • 8/4/2019 Computo paralelo

    109/173

  • 8/4/2019 Computo paralelo

    110/173

    Concurrent write on the EREW PRAM model

    Each processor writes an (address, processor number) pair to a uniqueelement ofT.

    The processors sort Tin time (log p).In constant time processors can set to 1 those elements of S

    corresponding to the winning processors.Winning processors write their values.

    For a write instruction, the highest priority processor accessing each memorylocation writes its value.

    For a read instruction, the highest priority processor accessing each memorylocation reads that location's value, then duplicates the value in (log p) time so

    there is a copy in a unique memory location for every processor to access thevalue.

  • 8/4/2019 Computo paralelo

    111/173

    ALGORITMOS PRAM

    If a PRAM algorithm has lower time complexity than an optimal RAM algorithm, it is

    because parallelism has been used.

    PRAM algorithms begin with only a single processor active have two phases.

    oIn the first phase a sufficient number of processors are activated,othese activated processors perform the computation in parallel.

    Given a single active processor, it is easy to see that logP activation steps are bothnecessary and sufficient forp processors to become active.

    Processor activation

    Exactly logP processor activation steps are necessary and sufficient to change from 1

    active processor top active processors.

  • 8/4/2019 Computo paralelo

    112/173

  • 8/4/2019 Computo paralelo

    113/173

    Modelo del rbol Binario

    The binary tree is one of the most important paradigms of parallel

    computingIn some algorithms data flows top-down from the root of the tree to the

    leaves.Broadcast and divide-and-conquer algorithms both fit this model.

    In broadcast algorithms the root sends the same data to every leaf.

    In divide-and-conquer algorithms the tree represents the recursivesubdivision of problems into subproblems.In other algorithms data flows bottom-up from the leaves of the tree to the root.

    These are calledfan-in orreduction operations.

  • 8/4/2019 Computo paralelo

    114/173

    Reduccin Paralela

    Given

    a set ofn values a1, a2, , anan associative binary operator,reduction is the process of computinga1a2an

    Parallel summation is an example of a reduction operation.

  • 8/4/2019 Computo paralelo

    115/173

    We represent each tree node with an element in an array.The mapping from the tree to the array is straightforward.

  • 8/4/2019 Computo paralelo

    116/173

    Sum ofn values

  • 8/4/2019 Computo paralelo

    117/173

    Complexity:

    The spawn routine requires logn/2 doubling steps.

    The sequential for loop executes log n times each iteration has constant timecomplexity.Hence the overall time complexity of the algorithm is (log p), given n/2processors.

  • 8/4/2019 Computo paralelo

    118/173

    Organizacin de procesadoresOrganizacin de procesadores representados por grafos

    A processor organization can berepresented by a graphnodes (vertices) represent processorsedges represent communication paths between pairs of processors

    Processor organizations are evaluated according to criteria that help usunderstand their effectiveness in implementing efficient parallel algorithms onreal hardware.These criteria are:

    Diameter.Bisection width.Number of edges per node

    Maximum edge length

  • 8/4/2019 Computo paralelo

    119/173

    Dimetro

    The diameter of a network is the largest distance between two nodes.Low diameter is better,

    because the diameter puts a lower bound on the complexity ofparallel algorithms requiring communication between arbitrary pairs ofnodes.

  • 8/4/2019 Computo paralelo

    120/173

    Ancho de biseccin de la red

    The bisection width of a network is the minimum number of edges that

    must be removed in order to divide the network into two halves (withinone).

    High bisection width is better,because in algorithms requiring large amounts of data movement, thesize of the data set divided by the bisection width puts a lower boundon thecomplexity of the parallel algorithm.

  • 8/4/2019 Computo paralelo

    121/173

    Nmero de aristas por nodo

    It is best if the number of edges per node is a constant independent of

    the network sizeThe processor organization scales more easily to systems with large

    numbers of nodes.

    Maximum edge length

    For scalability reasons it is best if the nodes and edges of the network can belaid out in three-dimensional space so that the maximum edge length is aconstant independent of the network size.

  • 8/4/2019 Computo paralelo

    122/173

    Introduccin a la complejidad de los algoritmos paralelosTiempos de corrimiento

    Tipos de anlisis para algoritmos

    1. Peor caso (usualmente)T(n) = tiempo mximo necesario para un problema de tamao n

    2. Caso medio o caso promedioT(n) = tiempo esperado para un problema de tamao nSe requiere establecer una distribucin estadstica

    3. Mejor casoT(n) = tiempo mnimo para un problema de tamao n. Esto puede ser engaosoporque no siempre ocurre.

    l jid d d l d l f i f l

  • 8/4/2019 Computo paralelo

    123/173

    La complejidad del peor caso de un programa RAM es la funcin f(n), eltiempo mximo que le toma al programa ejecutar todas las entradas de tamaon.

    La complejidad del tiempo esperado de un programa RAM es el tiempopromedio que le toma ejecutar las entradas de tamao n .

    Existen dos maneras de medir el tiempo en el modelo RAM:

    Criterio de costo uniforme: cada instruccin RAM requiere de una unidad detiempo para ejecutar y cada registro requiere una unidad de espacio enmemoria. Es apropiado si los valores manipulados por el programa cabendentro de una palabra de cmputo.

    Criterio de costo logartmico: se toma en cuenta que una palabra actual enmemoria tiene una capacidad limitada en memoria.

    L O

  • 8/4/2019 Computo paralelo

    124/173

    La gran O

    Usualmente se utiliza la notacin O (g(x)) para referirse a las funciones

    acotadas superiormente por la funcin g(x).

    La cota ajustada asinttica (notacin ) tiene relacin con las cotas superior e

    inferior asintticas (notacin ):

    Una cota superior asinttica es una funcin que sirve de cota superior de otrafuncin cuando el argumento tiende a infinito.

    O d l f i

  • 8/4/2019 Computo paralelo

    125/173

    Ordenes usuales para funciones:

    Los rdenes ms utilizados en anlisis de algoritmos, en orden creciente, son

    los siguientes (donde c representa una constante):

    D di d d l d d l f i di fi i t P

  • 8/4/2019 Computo paralelo

    126/173

    Dependiendo del orden de la funcin, se dice que es eficiente o no. Para unproblema de tamao n, el orden de la funcin que lo resuelve puede ser desdeptimo (mejor caso) o eficiente hasta ineficiente (peor caso).

    DISEO DE PROGRAMAS

    PARALELOS

  • 8/4/2019 Computo paralelo

    127/173

    PARALELOS

    Sistemas de memoria nica

    Single Memory SystemsThe CHALLENGE/Onyx uses a high speed system bus to connect allcomponents of the system.

    Memory has features:There is a single address map; that is, the same word of memory has the

    same address in every CPU.There is no time penalty for communication between processes because

    every memory word is accessible in the same amount of time from anyCPU.

    All peripherals are equally accessible from any process.Processes running in different CPUs can share memory and can update the

    identical memory locations concurrently.

  • 8/4/2019 Computo paralelo

    128/173

    Processes running in different CPUs can share memory and can update theidentical memory locations concurrently.

    Processes map a single segment of memory into the virtual address spacesof two or more concurrent processes.

    Two processes can transfer data at memory speeds, one putting the data intoa mapped segment and the other process taking the data out.

    They can coordinate their access to the data using semaphores located in theshared segment.

  • 8/4/2019 Computo paralelo

    129/173

    SISTEMAS DE MEMORIA MLTIPLEMultiple Memory Systems

    There is not a single address map. A word of memory in one nodecannot be addressed at all from another node.

    There is a time penalty for some interprocess communication.Peripherals are accessible only in the node to which they are

    physically attached.

    The message-passing interface (MPI) is designed specifically for anapplication that executes concurrently in multiple nodes.

    Modelos de ejecucin paralela

  • 8/4/2019 Computo paralelo

    130/173

    Modelos de ejecucin paralela

    Two features of the models for parallel programming:

    Granularity

    Communicationchannel

    The relative size of the unit of computation thatexecutes in parallel: a single statement, afunction, or an entire process.

    The basic mechanism by which theindependent, concurrent units of the programexchange data and synchronize their activity.

    Paralelismo a nivel procesos

  • 8/4/2019 Computo paralelo

    131/173

    Paralelismo a nivel procesosA UNIX process consists of

    an address space,a large set of process state values,one thread of execution.

    Interprocess communication (IPC) mechanisms can be usedoto exchange dataoto coordinate the activities of multiple, asynchronous processes.

    In traditional UNIX practice, one process creates another with the system callfork(), which makes a duplicate of the calling process, after which the twocopies execute in parallel.

    Typically the new process immediately uses the exec() function to load a newprogram.

    Lightweight process

  • 8/4/2019 Computo paralelo

    132/173

    Lightweight process

    It shares some of its process state values with its parent process.

    It does not have its own address space. It continues to execute in the address space of the original process.

    A lightweight process differs from a thread in two significant ways:

    It has a full set of UNIX state values. Some of these, for example the table of

    open file descriptors, can be shared with the parent process, but in general alightweight process carries most of the state information of a process. Dispatch of lightweight processes is done in the kernel, and has the same

    overhead as dispatching any process. The library support for statement-level parallelism is based on the use of

    lightweight processes.

    Process Creation

  • 8/4/2019 Computo paralelo

    133/173

    Process CreationThe process that creates another is called theparent process.The processes it creates arechildprocesses.The parent and its children together are a share group.

    The fork () function is the traditional UNIX way of creating a process.The new process is a duplicate of the parent process, running in a

    duplicate of the parent's address space.

    Both execute the identical program text.A parent process should not terminate while its child processes continue torun.

    Process Management

  • 8/4/2019 Computo paralelo

    134/173

    Process Management

    When the parent process has nothing to do after starting the child processes, it

    canloop on wait() until wait() reports no more children exist;then exit.

    Sometimes it is necessary to handle child termination, and the parent cannotsuspend.

    In this case the parent can treat the termination of a child process asanasynchronous event.

    Parallelism in Real-Time Applications

  • 8/4/2019 Computo paralelo

    135/173

    Parallelism in Real-Time Applications

    In real-time programs such as aircraft or vehicle simulators, separate

    processes are used to divide the work of the simulation and distribute it ontomultiple CPUs.

    In these applications, IRIX facilities (REACT Real-Time Programmer's Guide )can be used to:

    reserve one or more CPUs of a multiprocessor for exclusive use by theapplication

    isolate the reserved CPUs from all interruptsassign specific processes to execute on specific, reserved CPUs

    The Frame Scheduler

  • 8/4/2019 Computo paralelo

    136/173

    The Frame Schedulerseizes one or more CPUs of a multiprocessor,isolates them,executes a specified set of processes on each CPU in strict rotation.

    The Frame Scheduler has

    much lower overhead than the normal IRIX scheduler,features designed for real-time work, including

    odetection ofoverrun (when a scheduled process does not completeits work in the necessary time)

    ounderrun (when a scheduled process fails to execute in its turn).

    Paralelismo a nivel hilos

  • 8/4/2019 Computo paralelo

    137/173

    Paralelismo a nivel hilosA thread is an independent execution state within the context of a largerprogram; that is,

    a set of machine registers,a call stack,the ability to execute code.

    A program can create many threads to execute in the same address space.

    There are two main reasons of using threads:portabilityperformance.

    There are three key differences between a thread and a process:

  • 8/4/2019 Computo paralelo

    138/173

    There are three key differences between a thread and a process:

    A UNIX process has its own set of UNIX state information, for

    example, its own effective user ID and set of open file descriptors.Threads exist within a process and do not have distinct copies of these

    state values. Threads share the single state belonging to their process.Each UNIX process has a unique address space that are accessible only

    to that process.Threads within a process share the single address space belonging to

    their process.Processes are scheduled by the kernel.Threads are scheduled by code that operates in the user address space,

    without kernel assistance. Thread scheduling can be faster than processscheduling.

    Threads:

  • 8/4/2019 Computo paralelo

    139/173

    Threads:

    takes relatively little time to create or destroy, as compared to

    creating a lightweight process.shares all resources and attributes of a single process (except for the

    signal mask).

    If you wanteach executing entity to have its own set of file descriptorsto make sure that one entity cannot modify data shared with another

    entity

    You must use lightweight processes or normal processes.

    Threads cannot use these IPC mechanisms.

  • 8/4/2019 Computo paralelo

    140/173

    Threads can coordinate using mechanisms:

    Unnamed semaphores for general coordination and resourcemanagement.

    Message queues.Mutex objects, which allow threads to gain exclusive use of a shared

    variable.

    Condition variables, which allow a thread to wait when a controllingpredicate is false.

    Semaphores, locks, and barriers to coordinate between multiple threads withina single program.

    Mutexes:

  • 8/4/2019 Computo paralelo

    141/173

    ute es:A mutex is a software object that stands

    for the right to modify some shared variable,

    the right to execute a critical section of code.

    A mutex can be owned by only one thread at a time; other threads trying to acquire itwait.

    When a thread wants to modify a variable that it shares with other threads, or execute acritical section, the thread claims the associated mutex.

    This can cause the thread to wait until it can acquire the mutex.

    When the thread has finished using the shared variable or critical code, it releases themutex.If two or more threads claim the mutex at once,

    one acquires the mutex and continues, the others are blocked until the mutex is released.

    A mutex has attributes that control its behavior.

    Condition Variables

  • 8/4/2019 Computo paralelo

    142/173

    Condition Variables

    A condition variable provides a way in whicha thread can temporarily give up ownership of a mutex,wait for a condition to be true,then reclaim ownership of the mutex,

    All in a single operation.

    Preparing Condition Variables

    Condition variables are supplied witha mechanism of attribute objectsstatic and dynamic initializers.A condition variable must be initialized before use.

    Using Condition Variables

  • 8/4/2019 Computo paralelo

    143/173

    Using Condition Variables

    A condition variable is a software object that represents a test of a Boolean

    condition.

    Typically the condition changes because of a software event such as"other thread has supplied needed data."

    A thread that wants to wait for that event claims the condition variable,which causes it to wait.

    The thread that recognizes the event signals the condition variable,releasing one or all threads that are waiting for the event.

  • 8/4/2019 Computo paralelo

    144/173

    A thread holds a mutex that represents a shared resource. While holding themutex, the thread finds that the shared resource is not complete or not ready.

    The thread needs to do three things

    Give up the mutex so that some other thread can renew the sharedresource.

    Wait for the event that "resource is now ready for use."Re-acquire the mutex for the shared resource.

    These three actions are combined into one using a condition variable.

    When the event is signalled (or the time limit expires), the mutex is reacquired.

    Paralelismo a nivel declaraciones

  • 8/4/2019 Computo paralelo

    145/173

    Statement-Level Parallelism.

    The finest level of granularity.Statement-level parallel support is based on using common variables in memory,and so it can be used only within the bounds of a single-memory system.

    The method of creating an optimized parallel program is as follows: Write a complete application that runs on a single processor. Completely debug and verify the correctness of the program in serial

    execution. Apply the source analyzer. Add assertions to the source program. These are not explicit commands to

    parallelize, but high-level statements that describe the program's use of data. Run the program on a single-memory multiprocessor.

    Modelos de computacin distribuida

  • 8/4/2019 Computo paralelo

    146/173

    ode os de co pu ac d s bu da

    Modelo MPI (Message-Passing-Interface)

    MPI is a standard programming interface for the construction of a portable,parallel application in Fortran 77 or in C, especially when the application can bedecomposed into a fixed number of processes operating in a fixed topology (forexample, a pipeline, grid, or tree).

    Modelo PVM (Parallel Virtual Machine)

    PVM is an integrated set of software tools and libraries that emulates ageneral-purpose, flexible, heterogeneous, concurrent computing framework oninterconnected computers of varied architecture. Using PVM, you can create aparallel application that executes as a set of concurrent processes on a set ofcomputers that can include uniprocessors, multiprocessors, and nodes of Arraysystems.

    Each is a formal, abstract model for distributing a computation across the nodes of a

  • 8/4/2019 Computo paralelo

    147/173

    multiple-memory system, without having to reflect the system configuration in thesource code.

    Processes and threads allow you to execute in parallel within a single systemmemory. When the system memory is distributed among multiple independent machines,

    your program must be built around a message-passing model.

    In a message-passing model, your application consists of: multiple, independent processes, each with its own address space, running in possibly many different computers.

    Each process shares data and coordinates with the others by passing messages.IRIX supports two libraries: Message-Passing Interface (MPI) Portable Virtual Machine (PVM).

    Choosing Between MPI and PVM

  • 8/4/2019 Computo paralelo

    148/173

    g

    MPI interface is a primary and preferred model for distributed applications.

    In many ways, MPI and PVM are similar:

    Each is designed, specified, and implemented by third parties that haveno direct interest in selling hardware.

    Support for each is available over the Internet at low or no cost.Each defines portable, high-level functions that are used by a group of

    processes to make contact and exchange data without having to be awareof the communication medium.

    Each supports C and Fortran 77.Each provides for automatic conversion between different

    representations of the same kind of data so that processes can be

    distributed over a heterogeneous computer network.

    MPI

  • 8/4/2019 Computo paralelo

    149/173

    The primary reason MPI is preferred is performance.The design of MPI is such that a highly optimized implementation

    could be created for the homogenous environment.MPI applications take advantage to exchange data with small latencies

    and high data rates.

    Another difference between MPI and PVM is in the support for the"topology" (the interconnect pattern: grid, torus, or tree) of thecommunicating processes.

    In MPI, the group size and topology are fixed when the group is created.This permits low-overhead group operations.In PVM, group composition is dynamic and causes more overhead in

    common grouprelated operations.

    Converting a PVM program into an MPI program

  • 8/4/2019 Computo paralelo

    150/173

    A large extent the library calls of MPI and PVM provide similar functionality

    Some PVM calls do not have a counterpart in MPI, and vice versa. The semantics of some of the equivalent calls are inherently different for the two

    libraries. The process of converting a can be complicated, depending on the particular PVM

    calls and how they are used. PVM includes a console, which is useful for monitoring and controlling the states

    of the machines in the virtual machine and the state of execution of a PVM job. The MPI standard does not provide mechanisms for specifying the initial

    allocation of processes to an MPI computation and their binding to physicalprocessors.

    The differences between PVM and MPI

  • 8/4/2019 Computo paralelo

    151/173

    The chief differences between the current versions of PVM and MPI libraries

    are as follows:PVM supports dynamic creating of tasks, whereas MPI does not.PVM supports dynamic process groups; that is, groups whose

    membership can change dynamically at any time during a computation.MPI does not support dynamic process groups.

    The chief difference between PVM groups and MPI communicators is thatany PVM task can join/leave a group independently, whereas in MPI allcommunicator operations are collective.

    A PVM task can add or delete a host from the virtual machine, therebydynamically changing the number of machines a program runs on. This isnot available in MPI.

    The differences between PVM and MPI (2)

  • 8/4/2019 Computo paralelo

    152/173

    PVM provides two methods of signaling other PVM tasks:o

    sending a UNIX signal to another task,onotifying a task about an event by sending it a message with a user-specified tag that the application can check.

    oThese functions are not available in MPI. A task can leave from a PVM session as many times as it wants, whereas an MPI

    task must initialize/finalize exactly once. A PVM task can be registered by another task as responsible for adding new PVM

    hosts, or as a PVM resource manager, or as responsible for starting new PVMtasks.

    These features are not available in MPI. A PVM task can multicast data to a set of tasks. As opposed to a broadcast, this

    multicast does not require the participating tasks to be members of a group. MPIdoes not have a routine to do multicasts.

  • 8/4/2019 Computo paralelo

    153/173

    On the other hand, MPI provides several features that are not available inPVM, including

    a variety of communication modes,communicators,derived data types,additional group management facilities,virtual process topologies, larger set of collective communication calls.

    Programming models supported by PVM

  • 8/4/2019 Computo paralelo

    154/173

    Pure SPMD ProgramIn the SPMD program model, n instances of the same program arestarted as the n tasks of the parallel job, using the spawn command (or by

    hand at each of the n hosts simultaneously).No tasks are dynamically spawned in the tasks.This scenario is essentially the same as the current MPI one where no tasks

    are dynamically spawned.

    General SPMD Model

    In this model,n instances of the same program are executed as n tasks ofthe parallel job.

    One or more tasks are started at the beginning, and these dynamicallyspawn the remaining tasks in turn.

  • 8/4/2019 Computo paralelo

    155/173

    MPMD ModelIn an MPMD programming model, one or more distinct tasks (having

    different executables) are started by hand, and these tasks dynamicallyspawn other (possibly distinct) tasks.

    Control del paralelismoS t

  • 8/4/2019 Computo paralelo

    156/173

    Segmentos

    La memoria se manipula a travs de sus localidades de memoria, pero sudireccionamiento puede ser:

    lineal (directo), segmentacin (datos, cdigo, pila y extras), descriptores globales y locales

    (indican localidades compartidas por diferentes segmentos). paginacin (lineal + desplazamiento+tabla pginas + directorio de tablas) combinacin de segmentacin y paginacin (memoria virtual y niveles de

    proteccin:

    Kernel-0, Servicios del sistema-1, Extensiones Sist. Op. -2, Aplicaciones-3.)Kernel es el que tiene el mximo privilegio y la mayor proteccin.

    Excepciones (instrucciones) Interrupciones (dispositivos).

    ProcesosMPI l j j l d l t l d l li di d

  • 8/4/2019 Computo paralelo

    157/173

    MPI es el mejor ejemplo del control de paralelismo por medio de procesos.

    SemforosObjetos de software usados para coordinar el acceso a recursos contados.

    Lectores y escritoresSon objetos de software que marcan e interpretan una seal en cada proceso.

    Secciones crticas

    Cuellos de botella.

    SincronizacinUna manera conveniente de sincronizar procesos paralelos en sistemas demultiprocesadores, son las barreras.

    Comunicacin interprocesos

  • 8/4/2019 Computo paralelo

    158/173

    "Tipos de Comunicacin interprocesos

    Compartir memoria entre procesos (memoria compartida) Exclusin Mutua, incluye semforos (semaphores), candados (locks), y similares Sealizacin de eventos Colas de mensajes (Message Queues), describe dos variedades de cola de

    mensajes Bloqueo de archivo y grabado (File and Record Locking)

    Una comunicacin Interproceso (IPC) es Cualquier coordinacin de las acciones de mltiples procesos Enviar datos de un procesador a otro

    No se deben mezclar las implementaciones de un mecanismo dado en un programa

    sencillo, pueden dar resultados impredecibles.

  • 8/4/2019 Computo paralelo

    159/173

    TENDENCIAS DEL CMPUTO PARALELOPROYECTO GRID

  • 8/4/2019 Computo paralelo

    160/173

    PROYECTO GRID

    Is a single destination site for large-scale, non-profit research projects of globalsignificance. With the participation of over 3 million devices worldwide,grid.org projects like Cancer Research, Anthrax Research, Smallpox Researchand the new Human Proteome Folding Project (running in conjunction withIBM's new World Community Grid) have achieved record levels of processingspeed and success.

    Grid.org projects are powered by United Devices' Grid MP technology, theleading solution for commercial enterprise grid deployments.

    The basics

  • 8/4/2019 Computo paralelo

    161/173

    Grid computing is a form of distributed computing that involves coordinating

    and sharing computing, application, data, storage, or network resources acrossdynamic and geographically dispersed organizations. Grid technologiespromise to change the way organizations tackle complex computationalproblems. However, the vision of large scale resource sharing is not yet areality in many areas Grid computing is an evolving area of computing,where standards and technology are still being developed to enable this new

    paradigm.

    Why is it important?

  • 8/4/2019 Computo paralelo

    162/173

    Time and Money. Organizations that depend on access to computational power

    to advance their business objectives often sacrifice or scale back new projects,design ideas, or innovations due to sheer lack of computational bandwidth.Project demands simply outstrip computational power, even if an organizationhas significant investments in dedicated computing resources.

    Even given the potential financial rewards from additional computational

    access, many enterprises struggle to balance the need for additional computingresources with the need to control costs. Upgrading and purchasing newhardware is a costly proposition, and with the rate of technology obsolescence,it is eventually a losing one. By better utilizing and distributing existingcompute resources, Grid computing will help alleviate this problem.Delivering grid benefits today

    Many companies want to take advantage of the cost and efficiency benefits thatf id i f t t t d ith t b i l k d i t t th t

  • 8/4/2019 Computo paralelo

    163/173

    come from a grid infrastructure today, without being locked in to a system thatwill not grow with their needs.

    To provide customers the solution they need, United Devices tackled thecomplex security, scalability and unobtrusiveness issues required for a superiorenterprise grid, while building towards the open standards of the GGF. Byembracing these standards, United Devices lets our customers move towardcompatibility with future grid technologies and adopt upcoming technologies

    as they are developed while delivering the promises and benefits of the gridtoday.

    The Grid MP platform by United Devices works by amalgamating theunderutilized IT resources on a corporate network into a powerful enterprise

  • 8/4/2019 Computo paralelo

    164/173

    underutilized IT resources on a corporate network into a powerful enterprisegrid that can be shared by groups across the organization even

    geographically disparate groups.

    The most common corporate technology asset, desktop PCs, are also the mostunderutilized, often only using 10% of their total compute power even whenactively engaged in their primary business functions. By harnessing theseplentiful underused computing assets and leveraging them for revenue-driving

    projects, the Grid MP platform provides immediate value for companies whowant to move forward with their grid strategies without limiting any future griddevelopments.

    The benefits of building an enterprise grid with Grid MP platforminclude:

  • 8/4/2019 Computo paralelo

    165/173

    include:

    Lower Computing Costs

    On a price-to-performance basis, the Grid MP platform gets more work donewith less administration and budget than dedicated hardware solutions.Depending on the size of your network, the price-forperformance ratio forcomputing power can literally improve by an order of magnitude.

    Faster Project Results

    The extra power generated by the Grid MP platform can directly impact an

  • 8/4/2019 Computo paralelo

    166/173

    The extra power generated by the Grid MP platform can directly impact anorganization's ability to win in the marketplace by shortening product

    development cycles and accelerating research and development processes.

    Better Product Results

    Increased, affordable computing power means not having to ignore promisingavenues or solutions because of a limited budget or schedule. The power

    created by the Grid MP platform can help to ensure a higher quality product byallowing higher-resolution testing and results, and can permit an organizationto test more extensively prior to product release.Linux and Beowulf

    Linux is an open-source, freely-available Unix kernel. It includes free compilers,

    debuggers, editors, text processors, WWW servers, mail servers, SQL servers anda whole range of useful tools

    Beowulf is a project that produced parallel Linux clusters with off-the-shelfhardware and freely available software

  • 8/4/2019 Computo paralelo

    167/173

    hardware and freely available softwareoUses inexpensive Intel xx86 based boxes (PCs)oOne or more networking methods (10/100 Ethernet, Myrinet, ATM, etc)oA fast network connectivity (hub, switch, gigabit switch or other)oLinux operating system (freely available Unix clone, includes full source

    code)occ, f77, vi, emacs, perl, python and other free compilers and toolsoFree PVM or MPI implementations (MPICH, LAM) , etcoCommercial HPF implementations are available

    PC-cluster style supercomputer, but much cheaper

    Herramientas para Programacin ParalelaIntroduccin a los clusters Linux

  • 8/4/2019 Computo paralelo

    168/173

    Introduccin a los clusters Linux

    La compaa de efectos especiales Weta Digital utiliz un cluster de 200servidores Linux para realizar animaciones computarizadas en la pelcula El

    Seor de los Anillos. Weta Digital colabor con otros proyectos como Shreky Titanic.

    La ventaja de usar Linux es el bajo costo y el cdigo abierto (open source) que

    permite adaptar el sistema a las necesidades tcnicas.

    Un cluster es el conjunto de procesadores fsicamente separados (no en elmismo gabinete) bajo un mismo sistema operativo, que en conjunto trabajanpara una misma aplicacin.

  • 8/4/2019 Computo paralelo

    169/173

    Granja de procesadores (FARM) e imagen de satlite procesada en paralelo.

    A continuacin podemos ver otras aplicaciones donde fueron utilizadosclusters bajo sistema operativo Linux

  • 8/4/2019 Computo paralelo

    170/173

    clusters bajo sistema operativo Linux.

    Plot 3-D simulation of space shuttle streamlining

  • 8/4/2019 Computo paralelo

    171/173

    Airflow design for performance and fuel efficiency

  • 8/4/2019 Computo paralelo

    172/173

  • 8/4/2019 Computo paralelo

    173/173