departamento de informática de sistemas y computadores

Departamento de Informática de

Sistemas y Computadores

Detección concurrente de errores

en el �ujo de ejecución de un

procesador

Tesis Doctoral

Francisco Rodríguez Ballester

Dirigida por:

Dr. D. Juan José Serrano Martín

Septiembre de 2015

A Susana, por su amor y paciencia

A mi padre, al que le habría gustado ver este trabajo terminadoY a mi madre, que sonreirá al verlo

Agradecimientos

Quiero agradecer a mi familia y en especial a mi mujer Susana el amor, apoyoy comprensión que siempre me han brindado. Sentir ese respaldo incondicio-nal es una bendición cuando te enfrentas a un trabajo de este calibre.

También quiero agradecer desde aquí el apoyo y solidaridad de todas laspersonas del Departamento de Informática de Sistemas y Computadores de laUniversitat Politècnica de València que me han ayudado y animado duranteel desarrollo de este trabajo de tesis. Son muchas las que, de una manerau otra, han aportado su granito de arena para que pudiera terminarlo. Noquiero cometer el horrible pecado de dejarme a ninguno fuera de la que seríauna larga lista, así que no voy a nombraros personalmente; espero que sepáisperdonarme este pecado venial.

No puedo sin embargo dejar de señalar a Toni Martí, colega y compañerode fatigas, investigaciones, cursos y demás embrollos, siempre dispuesto aarrimar el hombro.

Y, cómo no, mi mas sincera gratitud a la excepcional persona que es JuanJosé Serrano, colega, mentor y amigo, amén de director de esta tesis. Sinél no hubiera empezado. Y sin su contínuo apoyo y su ayuda en todos losmomentos en los que la he necesitado de�nitivamente este trabajo no habriavisto la luz.

iii

Resumen

La incorporación de mecanismos de detección de errores es un elemento fun-damental en el diseño de sistemas tolerantes a fallos en los que, en muchoscasos, la detección de un error (ya sea transitorio o permanente) es el punto departida que desencadena toda una serie de acciones o activación de elementosque persiguen alguno de estos objetivos: la continuación de las operacionesdel sistema a pesar del error, la recuperación del mismo, la parada de susoperaciones llevando al sistema a un estado seguro, etc. Objetivos, en de�ni-tiva, que pretenden la mejora de las características de �abilidad, seguridady disponibilidad, entre otros, del sistema en cuestión.

Uno de estos elementos de detección de errores es un procesador de guardia;su trabajo consiste en monitorizar al procesador del sistema y comprobar queno se producen errores durante la ejecución del programa.

El principal inconveniente de las propuestas existentes a este respecto y queimpiden una mayor difusión de su uso es la pérdida de prestaciones y el au-mento de consumo de memoria que sufre el sistema monitorizado. El aumentoen el consumo de memoria se debe a la adición al programa original de datos(denominados �rmas) que contienen la información necesaria para permitirla detección de errores. Y la pérdida de prestaciones proviene del hecho deque, en general, se trata de que sea el propio procesador del sistema el querealice las operaciones necesarias (o, al menos, que recupere las �rmas) paradetectar posibles errores durante la ejecución.

En este trabajo se propone una nueva técnica de empotrado de �rmas (técnicadenominada ISIS � Interleaved Signature Instruction Stream) intercaladasdentro del espacio de la memoria del programa. Con esta técnica es un ele-mento separado del procesador del sistema (un procesador de guardia comotal) el que realiza las operaciones encaminadas a detectar los errores. A pesarde que las �rmas se encuentran mezcladas con las instrucciones del programaque está ejecutando, y a diferencia de las propuestas previas, el procesadorprincipal del sistema no se involucra ni en la recuperación de las �rmas nien las operaciones de cálculo correspondientes, lo que reduce la pérdida deprestaciones.

También se propone una novedosa técnica para que el procesador de guardiapueda veri�car la integridad estructural del programa que monitoriza com-probando las direcciones de salto empleadas. Esta técnica de procesado delas direcciones de salto viene a resolver en gran medida el problema de la

iv

comprobación de un salto a una nueva zona del programa cuando existenmúltiples posibles destinos válidos. Este problema no tenía una solución ade-cuada hasta el momento, y aunque la propuesta que aquí se hace no consigueresolver todos los posibles escenarios de salto sí permite incorporar un buennúmeros de ellos al conjunto de saltos veri�cables.

La propuesta teórica ISIS y sus mecanismos de detección de errores se com-plementan con la aportación de un sistema completo (procesador, procesadorde guardia, memoria caché, etc.) basado en ISIS y que incorpora los meca-nismos de detección que aquí se proponen. Se ha denominado HORUS a estesistema, y está desarrollado en lenguaje VHDL sintetizable, de manera quees posible no sólo simular el comportamiento del sistema ante la aparición deun fallo y analizar su evolución a partir de éste sino que también es posibleprogramar un dispositivo lógico programable tipo FPGA para su inclusiónen un sistema real.

Para programar el sistema HORUS se ha desarrollado en este trabajo unaversión modi�cada del compilador gcc que incluye la generación de las �rmasde referencia para el procesador de guardia como parte integral del procesode creación del programa ejecutable (compilación, ensamblado y montaje) apartir de código fuente escrito en lenguaje C.

Finalmente, otro trabajo desarrollado en esta tesis es el desarrollo de FIAS-CO (Fault Injection Aid Software COmponents), un conjunto de scripts enlenguaje Tcl/Tk que permiten la inyección de un fallo durante la simulaciónde HORUS con el objetivo de estudiar su comportamiento y su capacidadpara detectar los errores subsiguientes. Con FIASCO es posible lanzar cien-tos o miles de simulaciones en un entorno distribuido para reducir el tiemponecesario para obtener los datos de campañas de inyección a gran escala.

Los resultados demuestran que un sistema que utilice las técnicas que aquíse proponen es capaz de detectar errores durante la ejecución del programacon una mínima pérdida de prestaciones, y que la penalización en el consumode memoria al usar un procesador de guardia es similar a de las propuestasprevias.

v

Resum

La incorporació de mecanismes de detecció d'errors és un element fonamentalen el disseny de sistemes tolerants a fallades. En aquests sistemes la detecciód'un error, tant transitori com permanent, sovint signi�ca l'inici d'una sèried'accions o activació d'elements per assolir algun del objectius següents: man-tenir les operacions del sistema malgrat l'error, la recuperació del sistema,aturar les operacions situant el sistema en un estat segur, etc. Aquests ob-jectius pretenen, fonamentalment, millorar les característiques de �abilitat,seguretat i disponibilitat del sistema.

El processador de guarda és un dels elements emprats per a la detecciód'errors. El seu treball consisteix en monitoritzar el processador del sistemai comprovar que no es produeixen error durant l'execució de les instruccions.

Els principals inconvenients de l'ús del processadors de guarda és la pèrduade prestacions i l'increment de les necessitats de memòria del sistema quemonitoritza, per la qual cossa la seva utilització no està molt generalitza-da. L'increment del consum de memòria és conseqüència de la incorporacióal programa original d'unes dades, anomenades signatures, que contenen lainformació necessària per poder detectar els errors. Respecte de la pèrduade prestacions, es deguda a que és el propi processador qui ha de realitzarles operacions necessàries per a detectar els errors durant l'execució de lesinstruccions.

En aquest treball es proposa una nova tècnica de encastat de signatures (tèc-nica anomenada ISIS� Interleaved Signature Instruction Stream) intercalant-les en l'espai de memòria del programa. D'aquesta manera és possible que unelement extern al processador realitze les operacions dirigides a detectar elserrors, i al mateix temps permet que el processador execute el programa ori-ginal sense tenir que processar les signatures, encara que aquestes es trobenbarrejades amb les instruccions del programa que s'està executant.

També es proposa en aquest treball una nova tècnica que permet al proces-sador de guarda veri�car la integritat estructural del programa en execució.Aquesta veri�cació permet resoldre el problema de com comprovar que, alexecutar el processador un salt a una nova zona del programa, el salt esrealitza a una de les possibles destinacions que són vàlides. Fins el momentno hi havia una solució adequada per a aquest problema i encara que la tèc-nica presentada no resol tots el cassos possibles, sí afegeix un bon nombre desalts al conjunt de salts veri�cables.

vi

Les tècniques presentades es reforcen amb l'aportació d'un sistema complet(processador, processador de guarda, memòria cache, etc.) basat en ISIS ique incorpora els mecanismes de detecció que es proposen en aquest treball.A aquest sistema se li ha donat el nom de HORUS, i està desenvolupat enllenguatge VHDL sintetitzable, la qual cosa permet no tan sols simular elseu comportament davant la aparició d'un error i analitzar la seva evolució,sinó també programar-lo en un dispositiu FPGA per incloure'l en un sistemareal.

Per poder programar el sistema HORUS s'ha desenvolupat una versió modi-�cada del compilador gcc. Aquesta versió del compilador inclou la generacióde les signatures de referència per al processador de guarda com part delprocés de creació del programa executable (compilació, assemblat i enllaçat)des del codi font en llenguatge C.

Finalment en aquesta tesis s'ha desenvolupat un altre treball anomenat FIAS-CO (Fault Injection Aid Software COmponents), un conjunt d'scripts en llen-guatge Tcl/Tk que permeten injectar fallades durant la simulació del funcio-nament d'HORUS per estudiar la seua capacitat de detectar els errors i elseu comportament posterior. Amb FIASCO és possible llançar centenars omilers de simulacions en entorns distribuïts per reduir el temps necessari perobtenir les dades d'una campanya d'injecció de fallades de grans proporcions.

Els resultats obtinguts demostren que un sistema que utilitza les tècniquesdescrites és capaç de detectar errors durant l'execució del programa amb unapèrdua mínima de prestacions, i amb un requeriments de memòria similarsals de les propostes anteriors.

vii

Abstract

Incorporating error detection mechanisms is a key element in the design offault tolerant systems. For many of those systems the detection of an error(whether temporary or permanent) triggers a bunch of actions or activationof elements pursuing any of these objectives: continuation of the systemoperation despite the error, system recovery, system stop into a safe state, etc.Objectives ultimately intended to improve the characteristics of reliability,security, and availability, among others, of the system in question.

One of these error detection elements is a watchdog processor; it is responsibleto monitor the system processor and check that no errors occur during theprogram execution.

The main drawback of the existing proposals in this regard and that preventsa more widespread use of them is the loss of performance and the increasedmemory consumption su�ered by the monitored system. The memory con-sumption increase is due to the addition to the original program of somedata (called signatures) containing the required information to enable thedetection of errors. And the performance loss comes from the fact that itis generally the system processor itself which perform the operations (or, atleast, fetch the signatures) needed for the error detection mechanisms.

In this PhD a new technique to embed signatures is proposed. The techniqueis called ISIS � Interleaved Signature Instruction Stream � and it embedsthe watchdog signatures interspersed with the original program instructionsin the memory. With this technique it is a separate element of the systemprocessor (a watchdog processor as such) who carries out the operations todetect errors. Although signatures are mixed with program instructions, andunlike previous proposals, the main system processor is not involved neitherin the recovery of these signatures from memory nor in the correspondingcalculations, reducing the performance loss.

A novel technique is also proposed that enables the watchdog processor veri�-cation of the structural integrity of the monitored program checking the jumpaddresses used. This jump address processing technique comes to largelysolve the problem of verifying a jump to a new program area when there aremultiple possible valid destinations of the jump. This problem did not havean adequate solution so far, and although the proposal made here can notsolve every possible jump scenario it enables the inclusion of a large numberof them into the set veri�able jumps.

viii

The theoretical ISIS proposal and its error detection mechanisms are com-plemented by the contribution of a complete system (processor, watchdogprocessor, cache memory, etc.) based on ISIS which incorporates the detec-tion mechanisms proposed here. This system has been called HORUS, andis developed in the synthesizable subset of the VHDL language, so it is pos-sible not only to simulate the behavior of the system at the occurrence of afault and analyze its evolution from it but it is also possible to program aprogrammable logic device like an FPGA for its inclusion in a real system.

To program the HORUS system in this PhD a modi�ed version of the gcc

compiler has been developed which includes the generation of signatures forthe watchdog processor as an integral part of the process to create the exe-cutable program (compilation, assembly, and link) from a source code writtenin the C language.

Finally, another work developed in this PhD is the development of FIASCO(Fault Injection Aid Software Components), a set of scripts using the Tcl/Tklanguage that allow the injection of a fault during the simulation of HORUSin order to study its behavior and its ability to detect subsequent errors. WithFIASCO it is possible to perform hundreds or thousands of simulations ina distributed system environment to reduce the time required to collect thedata from large-scale injection campaigns.

Results show that a system using the techniques proposed here is able todetect errors during the execution of a program with a minimum loss ofperformance, and that the penalty in memory consumption when using awatchdog processor is similar to previous proposals.

ix

Tesis por compendio de publicaciones

La tesis es un trabajo de creación inédita, una investigación rigurosa, untrabajo de producción cientí�ca, marco inicial de la especialidad de un in-vestigador [1]. Constituyéndose en una fuente de información que re�eja ellogro en su propio campo del saber, estando directamente relacionado conla búsqueda y la transmisión del conocimiento a través de la informacióndocumentada, donde es de gran importancia la recopilación y el análisis dedatos para el origen de la producción cientí�ca.

Con el ritmo actual de la transmisión del conocimiento, las Universidades ylas Instituciones Profesionales y de Investigación de alto nivel han aceptadoel papel de preparar a los futuros cientí�cos y conceder el grado de Doctora aquellos que demuestren ser capacitados para llevar a cabo investigacio-nes de alta calidad (Nascimento, 2000) y cuya producción cientí�ca sea unapráctica habitual de la publicación de los resultados de las búsquedas quevan surgiendo durante el desarrollo de las tesis doctorales, componiendo untrabajo original de investigación, que no siempre es totalmente inédito.

Basado en esta información, esta tesis doctoral se presenta en forma de undocumento estructurado por compendio de artículos previamente publicados,guardando relación entre sí. Las publicaciones poseen calidad contrastada, enbase al prestigio de la publicación en que han sido insertados.

La presentación de este documento, en esta aspecto concreto, busca estarde acuerdo con las normativas de estudio establecidas por el Programa deDoctorado en Arquitectura y Tecnología de los Sistemas Informáticos delDepartamento de Informática de Sistemas y Computadores perteneciente ala Universitat Politècnica de València, de acuerdo con sus líneas de investi-gación, mediante la oportuna tramitación ofrecida dentro de su organizaciónfuncional y aprobación de la propuesta del Proyecto de Tesis por la direccióny la Comisión de Doctorado de esta Universidad.

x

Índice general

Agradecimientos iii

Resumen iv

Resum vi

Abstract viii

Tesis por compendio de publicaciones x

I Introducción 1

1. Introducción y objetivos de la tesis 2

1.1. Fundamentos . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2. Motivación . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3. Objetivos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4. Organización de la memoria . . . . . . . . . . . . . . . . . . . 10

2. ISIS: Propuesta de empotrado de �rmas 13

xi

2.1. Introducción . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2. Análisis de las propuestas existentes . . . . . . . . . . . . . . . 15

2.3. Propuesta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1. Descripción de la �rma de referencia . . . . . . . . . . 22

2.3.2. Mecanismos de detección de errores . . . . . . . . . . . 24

2.3.3. Modi�caciones al procesador y a los programas . . . . 28

2.3.4. Tratamiento de los saltos . . . . . . . . . . . . . . . . . 34

2.4. Soporte software . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4.1. Estructura interna del gcc . . . . . . . . . . . . . . . . 40

2.4.2. Elementos internos de las binutils más relevantes . . 42

2.4.3. Inserción automática de las �rmas de referencia . . . . 46

2.4.4. Uso práctico del compilador . . . . . . . . . . . . . . . 48

2.5. Conclusiones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3. HORUS: Implementación de la técnica ISIS 52

3.1. Introducción . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2. Banco de pruebas . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.1. Organización del banco de pruebas . . . . . . . . . . . 56

3.3. Arquitectura del sistema . . . . . . . . . . . . . . . . . . . . . 57

II Publicaciones 60

4. A Watchdog Processor Architecture with Minimal Perfor-

mance Overhead 61

Francisco Rodríguez, José Carlos Campelo, Juan José Serrano

xii

5. The HORUS Processor 74


6. Delivering Error Detection Capabilities into a Field Pro-

grammable Device: The HORUS Processor Case Study 84


7. A Memory Overhead Evaluation of the Interleaved Signature

Instruction Stream 92


8. Improving the Interleaved Signature Instruction Stream Tech-

nique 102


9. Improving the Interleaved Signature Instruction Stream Tech-

nique 112


10. Control Flow Error Checking with ISIS 119

Francisco Rodríguez, Juan José Serrano

11. Reducing the VHDL-Based Fault Injection Simulation Time

in a Distributed Environment 132


12. A Distributed Simulation Environment for Fault Injection

Analysis on SoC Models 142


xiii

III Conclusiones 148

13.Conclusiones 149

13.1. Introducción . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

13.2. Aportaciones . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

13.3. Conclusiones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

13.4. Publicaciones directamente relacionadas con el trabajo de tesis 157

13.5. Trabajo futuro . . . . . . . . . . . . . . . . . . . . . . . . . . 159

xiv

Parte I:

Introducción

Capı́tulo 1Introducción y objetivos de la tesis

If you steal from one author it's plagiarism; if you stealfrom many it's research.

Wilson Mizner

1.1. Fundamentos

Los sistemas de control industrial, englobando bajo este concepto a los siste-mas de control de maquinaria de fabricación, de automatización de procesosen diversos ámbitos, automoción, ferrocarriles, aviónica, agricultura, medi-cina, etcétera, han sido objeto de una constante evolución en los últimosaños.

Esta evolución ha sido motivada por diversas causas. Entre éstas, podemosaludir a la �exibilidad demandada, es decir, la posibilidad de poder adaptarrápidamente el sistema ante cambios en el proceso a controlar, la rapidezy precisión requerida, así como la integración de estos sistemas en otros demayor nivel, incremento de las prestaciones, reducción de costes e incrementode la competitividad.

A estas necesidades, por otra parte, se unen los avances en el campo tecno-lógico: autómatas programables más complejos, ordenadores especialmentediseñados para entornos industriales, microcontroladores cada vez más po-tentes, procesadores digitales de señal e importantes avances en las redes de

2

comunicación. En este último punto cabe destacar la aparición y estanda-rización de redes especí�camente desarrolladas para este tipo de entornos,denominadas buses de campo o redes de área local industrial. Éstas han si-do uno de los factores más importantes para favorecer el desarrollo de lossistemas distribuidos de control industrial.

Con estos avances ha sido posible abandonar la concepción clásica de lossistemas de control, típicamente centralizados, con multitud de conexionespunto a punto para obtener la información del proceso y generar las salidas,por sistemas distribuidos en los que las funciones a realizar se dividen en-tre una serie de dispositivos o nodos interconectados mediante una red decomunicación.

Este desarrollo ha posibilitado la aplicación de estos sistemas en multitud deaplicaciones, desde las más sencillas hasta las más complicadas y críticas. Es-to ha hecho que, como importante requerimiento en estos sistemas, sea cadavez más frecuente, además de obtener una serie de prestaciones, demandarunos determinados índices de una serie de atributos como la �abilidad, segu-ridad, disponibilidad, entre otros. Es decir, se exige que los sistemas sean defuncionamiento garantizado. Es en este punto donde el área de los sistemastolerantes a fallos cobra una vital relevancia.

Se dice que un sistema es tolerante a fallos cuando es capaz de continuarsu trabajo aunque se mani�esten errores permanentes, transitorios o inter-mitentes (errores para los cuales se han previsto mecanismos de toleranciaen la etapa de diseño del sistema). Por otra parte, se dice que un sistematiene un comportamiento seguro, si una vez que se produce una avería y nose puede continuar el correcto funcionamiento, el sistema se detiene en unestado conocido que no causa ningún tipo de riesgo al proceso que controla.

La �abilidad de los sistemas informáticos en general y de los sistemas indus-triales en particular, ha sido y es uno de los principales objetivos desde elpunto de vista del diseño de estos sistemas. La introducción de los compu-tadores en aplicaciones críticas desembocó en el avance de�nitivo de los sis-temas tolerantes a fallos. Mejoras en la �abilidad, seguridad, disponibilidad,con�dencialidad, entre otros atributos de la garantía de funcionamiento, sehan convertido en un requerimiento cada vez más importante en el diseño deestos sistemas.

Dentro del entorno industrial, el uso de sistemas distribuidos que controlen unproceso o aplicación que pueda entrañar un cierto riesgo, bien para el procesoque se intenta automatizar, o bien para los usuarios de los equipos que los

3

incorporen, hace necesario pensar en la garantía de funcionamiento que estossistemas van a ser capaces de ofrecer. ¾Cuál va a ser la �abilidad de estossistemas?, ¾cómo advertir los errores para poder subsanarlos, o conseguir almenos que el sistema responda de forma segura?

Por tanto, hay que incorporar algunos mecanismos que nos permitan detec-tar los posibles errores que pueda sufrir un sistema. Estos errores podrán serde tipo permanente, cuando algún componente del sistema se averíe o, deotro modo, de tipo intermitente o transitorio: cada vez más, tener sistemasque funcionan a frecuencias más altas y usando niveles de tensión más ba-jos (por tanto con una diferencia menor entre los valores lógicos) los haceser más susceptibles a este tipo de fallos transitorios. Al mismo tiempo, eluso de muchos de estos sistemas en ambientes ruidosos, ambientes industria-les, de automoción, transporte, entre otros, hace necesario incorporar, comohemos mencionado anteriormente, mecanismos para hacer que los sistemasfuncionen correctamente ante la presencia de diversos errores.

El principal objetivo y también criterio de selección en el diseño de siste-mas tolerantes a fallos es conseguir una alta garantía de funcionamiento.Sin embargo, otras variables importantes a tener en cuenta, y que a vecesdeterminan la elección de un mecanismo frente a otros son:

La pérdida de prestaciones que siempre implica la inclusión de mecanis-mos de tolerancia a fallos en un sistema. Esta caracterísitica hace queen la actualidad se aborde el problema del estudio conjunto de pres-taciones y garantía de funcionamiento (performability), para compararentre diferentes sistemas de tolerancia a fallos.

El aumento de coste de un sistema tolerante a fallos frente a un sis-tema de funcionalidad similar pero que no tolere fallos; este coste esproporcional a la cantidad de redundancia añadida.

La velocidad de respuesta de los mecanismos empleados, o la mayor omenor facilidad en su diseño.

Considerando como objetivo prioritario a la hora de diseñar un sistema tole-rante a fallos la consecución de una alta garantía de funcionamiento, uno delos principales problemas a resolver es precisamente su validación. La vali-dación basada en la evaluación de los modelos teóricos del sistema (basadosen cadenas de Markov, redes de Petri o similares) en sus primeras fases dedesarrollo no permite su conocimiento exacto, ya que deja por resolver dosgrandes incógnitas:

4

Los factores o coe�cientes de cobertura en la detección o recuperaciónde los errores, que miden la bondad de los mecanismos de tolerancia afallos introducidos.

Los tiempos de latencia en la detección o recuperación de los errores,que miden la rapidez de los mecanismos de tolerancia a fallos introdu-cidos.

Precisamente estas incógnitas son fundamentales a la hora de tomar deci-siones acerca de un sistema tolerante a fallos. Por ello, la obtención exactade estas incógnitas se basa en métodos experimentales. Además, debido so-bre todo a la baja probabilidad de ocurrencia de fallos en estos sistemas,los experimentos no pueden basarse en la observación directa de un sistematolerante a fallos en condiciones normales. EN de�nitiva, los experimentos sedeben basar en la introducción (inyección) voluntaria y controlada de fallosen el sistema para observar su respuesta ante ellos.

1.2. Motivación

De forma general y desde el punto de vista de los sistemas distribuidos,la mejora de la garantía de funcionamiento se puede obtener mediante laaplicación de mejoras en tres niveles distintos:

1. A nivel de los nodos del sistema distribuido se pueden desarrollar ar-quitecturas tolerantes a fallos. Se intenta conseguir que el propio nodose recupere de la avería y continue trabajando; si esto no es posible, elnodo debe ser capaz al menos de detectarla.

2. A nivel de sistema se puede mejorar la garantía de funcionamientomediante la inclusión de estrategias de cooperación entre los distintosnodos para recuperar el sistema o para continuar funcionando, aunquesea con funcionalidades o prestaciones reducidas.

3. A nivel de red. La aplicación de niveles de red con características detolerancia a fallos mejorará la garantía de funcionamiento, dado que sedispone de diferentes medios físicos para hacer llegar la información deuna parte a otra del sistema.

5

El trabajo de esta tesis doctoral se centra en el primero de los niveles; con-cretamente en el diseño de mecanismos de detección de fallos en un sistemamonoprocesador. El objetivo principal no es obtener mecanismos de enmasca-ramiento de fallos, que permiten que el sistema continue funcionando a pesardel fallo observado gracias a una enorme redundancia espacial o temporal,sino el de conseguir una alta tasa de detección que permita la utilizaciónde los mecanismos de cooperación mencionados anteriormente para que elsistema global (el sistema distribuido, como una entidad) siga funcionando.

Esto no signi�ca que otros mecanismos de tolerancia a fallos aplicados a estamisma clase de sistemas sea inútil o ine�caz, sino todo lo contrario, dadoque los mecanismos de detección de errores aquí desarrollados pueden serel complemento ideal para otro tipo de soluciones más enfocadas hacia laredundancia, tanto espacial como temporal.

1.3. Objetivos

Desde hace tiempo se está trabajando en el área de la detección de erroresen los sistemas computadores, y se han creado técnicas para la detecciónconcurrente de errores en el �ujo de ejecución de un procesador con el �nde dotar al sistema de un cierto nivel de integridad en lo que a ejecución deinstrucciones se re�ere.

El problema de los errores durante la ejecución de instrucciones es tan apre-miante (no hablamos aquí de los errores en los cálculos realizados sobre losdatos, problema tan crítico como el que aquí se trata) que se han desarrolladotécnicas puramente software para dotar a procesadores de propósito generalde cierta capacidad de detección de los mismos.

El uso de estas técnicas software provoca una enorme sobrecarga para elprocesador, con la consiguiente pérdida de prestaciones. Con el avance de latecnología y la capacidad de integración estas técnicas han sido superadasgracias a la proposición de innovaciones a nivel de diseño de la circuitería delprocesador o del sistema en el que se integra, cada día incorporadas de formamás sencilla y e�ciente en el desarrollo de nuevos productos.

Sin embargo, las técnicas con soporte hardware están mayoritariamente ba-sadas en el desarrollo y utilización de nuevas instrucciones, especí�camentediseñadas para ayudar en la detección de errores durante la ejecución.

6

Disponer de una circuitería especí�ca para veri�car el funcionamiento delsistema ha permitido ampliar el campo de los errores detectables, incorpo-rando no sólo la veri�cación estructural del programa (que es a lo más quellegan las técnicas software) sino también la veri�cación de cada una de lasinstrucciones que componen el programa.

Sin embargo, hacer partícipe al procesador de la detección de sus propiosfallos de ejecución mediante la incorporación de nuevas instrucciones ha pro-ducido (en las propuestas planteadas) algunos o todos de los siguientes in-convenientes:

1. El procesador al que se aplica no es un procesador de propósito ge-neral, sino un procesador con una tarea muy especí�ca. El diseño dedicho procesador no permite la inclusión de mecanismos (interrupcio-nes, excepciones) absolutamente necesarios para cualquier sistema depropósito general.

2. El programador del sistema debe tener unos conocimientos muy espe-cí�cos sobre detección de errores, para poder insertar correctamentelas instrucciones correspondientes. Esto es mucho más grave de lo quepueda parecer en un principio, pues la necesaria especialización de estosprogramadores no permite la adecuada difusión del uso de técnicas dedetección de errores. Por no hablar de que la incorporación del progra-mador a la cadena de elementos que deben trabajar de forma armoniosapara la consecución del objetivo �nal (la detección de errores) lleva apa-rejada de forma inherente la posibilidad de que el sistema sufra fallosproducidos por la intervención humana en dicha cadena. Podríamosconsiderar a éstos últimos un caso especial de los llamados errores dediseño.

3. El procesador resultante ya no permite la ejecución de código binariode la arquitectura de la que originalmente proviene. Esto también pro-duce un importante rechazo a la difusión de las técnicas de detecciónde fallos, pues en lugar de permitir una incorporación gradual y pau-latina en los sistemas existentes se fuerza a revisar todo el código yaimplantado.

Y al mismo tiempo, dichas propuestas han dejado algunos o todos de lossiguientes puntos sin resolver:

1. La pérdida de prestaciones y el incremento en el consumo de memoriaal usar una técnica hardware es del mismo orden de magnitud que

7

la provocada con técnicas software. Es cierto que en este caso llevaaparejado un aumento en la cantidad de errores detectables, pero elcoste es aún demasiado notable.

2. En la veri�cación de la integridad estructural del programa (la veri-�cación del �ujo de ejecución) no están contemplados todos los casos(estructuras de control, arquitectura del programa) que se pueden pre-sentar en un software genérico, lo que supone abrir una ventana de in-certidumbre (más o menos importante en función del uso que de estoscasos no tratados se hace en el programa a veri�car) sobre la integridadestructural del mismo.

El presente trabajo pretende dar un paso hacia la resolución de los incon-venientes anteriormente mencionados y reducir o minimizar al menos el im-pacto de los puntos sin resolver dejados como tema abierto de las propuestasprevias.

En particular, esta tesis se centra en la proposición de una nueva formula-ción para la inserción de �rmas en el �ujo de ejecución de un procesador depropósito general que permita reducir la pérdida de prestaciones sin mermarla capacidad (la cobertura) de detección de errores ni empeorar el tiemponecesario para la detección (lo que se denomina tiempo de latencia).

Al mismo tiempo dicha proposición incluirá nuevos mecanismos de detecciónde errores con lo que se podrá reducir el número de escenarios no tratados.

Para comprobar de forma práctica la factibilidad de dicha propuesta, se di-señará un procesador basado en una arquitectura ampliamente conocida yutilizada en el mercado para incorporar los mecanismos propuestos. La im-plementación práctica demostrará no sólo la viabilidad de la propuesta, sinotambién la posibilidad de su aplicación a un sistema de propósito generalen el que puedan convivir tareas a las que se les ha incorporado la informa-ción necesaria para los mecanismos de detección de errores y código binarioestándar de la arquitectura base del procesador (la tantas veces nombradacompatibilidad con sistemas previos).

Amén de que la alteración de las tareas mencionada un poco más arribano ha de suponer ningún conocimiento especí�co del programador más alláde incluir un parámetro más en el momento de compilar el código fuente;todo el trabajo será realizado de forma automática por el compilador y lasherramientas sobre las que se apoya.

8

Para concretar, los objetivos especí�cos de esta tesis son los siguientes:

Plantear una nueva propuesta que tenga en cuenta los factores limita-dores de las propuestas previas, para incluir mecanismos de mejora.

Identi�car una arquitectura sobradamente conocida sobre la que llevara la práctica la propuesta anterior. Plantear la arquitectura del sistemaresultante.

Demostrar la factibilidad de la arquitectura anterior, desarrollando unmodelo en un lenguaje de descripción de hardware que permita su pos-terior síntesis sobre un dispositivo lógico programable. El hecho de dis-poner de un modelo sintetizable es lo que permite garantizar que laarquitectura es viable.

Modi�car las herramientas de desarrollo de software para dicha arqui-tectura de modo que permitan la incorporación, de la forma más trans-parente posible para el programador, de los mecanismos de detecciónde errores implementados.

Demostrar mediante simulación o prueba experimental que los mecanis-mos de detección funcionan. No se trata de obtener una caracterizaciónde los mismos, lo que exigiría un enorme trabajo adicional, aunque evi-dentemente ésta se dejará planteada como una de las líneas abiertascomo continuación natural de este trabajo. Lo que se pretende demos-trar en este caso es que la incorporación de los mecanismos de detecciónde errores al procesador original no es un mero artefacto cosmético, sinoque dichos mecanismos están en disposición de llevar a cabo su tarea;esto es, permiten la detección de fallos en la ejecución del procesadorde forma concurrente a la ejecución propiamente dicha.

Otra cosa es llegar a cuanti�car las características del sistema, en lo quea detección de fallos se re�ere, tras la incorporación de dichos mecanis-mos. Esta caracterización incluiría, entre otros, los siguientes paráme-tros: i) el porcentaje de los fallos que, aun ocurriendo en el sistema, noresultan en una avería o mal funcionamiento del mismo, ii) la cobertu-ra de detección o procentaje de fallos que, produciendo un error en elsistema, son efectivamente detectados (puesto que ningún mecanismode detección es perfecto, algunos errores no serán detectados), iii) elimpacto que supone, para las averías del sistema, la incorporación denueva circuitería en principio diseñada para la detección de errores, etc.

9

Cuanti�car la penalización en la que se incurre por dotar al sistemade dichos mecanismos: el incremento en el consumo de la memorianecesaria y la degradación de las prestaciones obtenidas son en estepunto los valores a obtener. Para este objetivo particular será necesariohacer un análisis de las causas que producen dicha penalización, demanera que los experimentos a realizar estén claramente encaminadosa producir sobre el sistema el máximo estrés posible.

1.4. Organización de la memoria

Este documento se ha desarrollado en tres partes: Una primera parte escritaespecí�camente para la redacción de este documento en la que se introducenlas ideas y conceptos fundamentales del trabajo desarrollado; una segundaparte en la que se incorporan las publicaciones asociadas a esta tesis doctoralen capítulos individuales adaptados al formato de este documento; y una par-te �nal también inédita en la que se presentan las conclusiones y se formulanposibles caminos para la continuación del trabajo que aquí se presenta.

Esta estructura es la que se especi�ca en la Normativa de los Estudios deDoctorado de la Universitat Politècnica de València aprobada por el Con-sejo de Gobierno en su sesión de 15 de diciembre de 2011 (publicado en elBulletí O�cial de la Universitat Politècnica de València no 54) y modi�cadapor acuerdo de la Comisión de Doctorado el 9 de abril de 2013 y aprobadaen Consejo de Gobierno el 25 de abril de 2013 para una tesis doctoral porcompendio de publicaciones.

En este capítulo (Capítulo 1) se presenta el entorno en el que se enmarcael trabajo de esta tesis y se desgranan los objetivos que se persiguen con sudesarrollo.

El Capítulo 2 se centra en describir la nueva propuesta de generación e inser-ción automática de �rmas a lo largo del código ejecutable de una aplicaciónque permite incorporar un mecanismo de detección concurrente de errores(más concretamente un procesador de guardia) sin que el usuario (el pro-gramador del sistema en este caso) necesite ningún conocimiento especí�co.A esta propuesta se le ha dado el nombre de una antigua deidad egipcia,ISIS, coincidiendo �casualmente� con el acrónimo de Interleaved SignatureInstruction Stream.

El Capítulo 3 desarrolla la propuesta teórica del capítulo anterior, propo-

10

niendo las modi�caciones mínimas necesarias a un procesador RISC basadoen la arquitectura MIPS para implementar de forma práctica las propuestasplanteadas con ISIS, así como las modi�caciones al sistema de desarrollo desoftware, herramientas basadas en el compilador gcc, con una amplia basede usuarios tanto en los sistemas de propósito general como en los entornosespecí�cos de los sistemas empotrados y con soporte para la arquitectura deprocesadores MIPS.

No sólo se han incorporado en este procesador las propuestas de ISIS; ademásse ha dotado al procesador resultante (denominado HORUS) de la capacidadde alternar, en tiempo de ejecución, entre tareas que incorporan las técnicasde ISIS y tareas compatibles a nivel binario con la arquitectura MIPS original.

Este procesador se ha desarrollado utilizando el lenguaje de descripción dehardware VHDL, con el �n de poder pasar del modelo de simulación a un dis-positivo lógico programable real tras el correspondiente proceso (automático)de síntesis.

Además del procesador se describe en este capítulo el conjunto de elementosadyacentes al mismo y que también han sido desarrollados para disponer deun sistema completo sobre el que poder realizar posteriores experimentos.Estos elementos son:

El procesador de guardia que monitoriza y veri�ca el funcionamientodel procesador principal. Este procesador basa dicha veri�cación enlos valores insertados en el programa a ejecutar en el momento de lacompilación/enlazado (valores que se denominan �rmas).

Una memoria caché con dos puertos de acceso para permitir la lecturade una instrucción por el procesador principal y de una �rma por elprocesador de guardia de forma simultánea.

El conjunto de elementos necesarios para construir el bus multimaestrode altas prestaciones AMBA-AHB siguiendo la especi�cación desarro-llada por ARM. Este bus multimaestro va a permitirnos conectar elprocesador y su caché de instrucciones con un conjunto de periféricosy, lo que es más importante desde el punto de vista de este trabajo, conla memoria del sistema.

La interfaz con el exterior del dispositivo programable para disponer deuna memoria ROM y RAM de tipo estática que conforma la memoriadel sistema en la que se almacenan instrucciones y datos.

11

Para la realización de los experimentos realizados sobre el modelo VHDL delsistema que nos permitan demostrar la funcionalidad de los mecanismos dedetección de fallos se ha desarrollado una herramienta especí�ca denominadaFIASCO (acrónimo de Fault-Injection Aid Software Components). Con estaherramienta (conjunto de scripts sería más correcto) se han podido inyectarfallos sobre el modelo VHDL del sistema, a �n de veri�car que el fallo, unavez ha dado lugar a una avería, es detectado por el procesador de guardia.

FIASCO está íntimamente ligado con el simulador de VHDL utilizado a lolargo de esta tesis, Modelsim, de Mentor Graphics. Con el conjunto FIAS-CO+modelsim es posible diseñar campañas de inyección de fallos sobre elmodelo del sistema HORUS, lanzando posteriormente la ejecución de cientoso miles de simulaciones a un conjunto de máquinas en un entorno distribuido.Los resultados son recogidos de forma automática y procesados para deter-minar si se ha producido una avería, si el procesador de guardia ha detectadoel fallo, etc.

También se han obtenido datos para la evaluación de prestaciones del proce-sador principal:

Por un lado, el análisis del código binario resultante tras la inserciónde las �rmas y su comparación con el código binario original permitedeterminar el incremento en el uso de la memoria.

Por otro lado, la simulación del sistema con una carga determinada per-mite, comparando el tiempo de ejecución del sistema cuando la carga(el programa) incorpora �rmas para el procesador de guardia respec-to del caso original en el que no existen dichas �rmas y una vez seha inhibido el funcionamiento del procesador de guardia (para impedirque inter�era en el funcionamiento de la memoria caché de instruccio-nes), determinar la pérdida de prestaciones relativa por la inclusión delprocesador de guardia en el sistema.

Finalmente, un resumen de las aportaciones de este trabajo, las publicacionesa las que ha dado lugar, así como la descripción de algunas de las principa-les líneas de trabajo abiertas están plasmadas en las conclusiones de estamemoria, recogidas en el Capítulo 13.

12

Capı́tulo 2ISIS: Propuesta de empotrado de�rmas

Yet, it is alarming to observe that the explosive growthof complexity, speed, and performance of single-chip pro-cessors has not been paralleled by the inclusion of moreon-chip error detection and recovery features..

A. Avizienis [2]

2.1. Introducción

En el �modelo para el futuro� que Avizienis describe en [2] queda patentela urgente necesidad de incorporar mecanismos de tolerancia a fallos en lossistemas de computación que usamos a diario: �Yet, it is alarming to observethat the explosive growth of complexity, speed, and performance of single-chip processors has not been paralleled by the inclusion of more on-chip errordetection and recovery features�.

Como primer paso para esta incorporación es de una importancia fundamen-tal disponer de mecanismos de detección de errores e�cientes. Dado que lainmensa mayoría de los fallos son de naturaleza transitoria, el uso de me-canismos concurrentes de detección de errores es del máximo interés, dadaslas características de alta cobertura y, simultáneamente, mínima latencia de

13

detección, características primordiales a la hora de permitir la recuperacióndel sistema ante la avería producida.

Y como los experimentos nos han demostrado [3, 4, 5, 6], un alto porcentajede los errores no sobreescritos acaba produciendo un error en el �ujo deejecución del procesador.

Al respecto de la incoporación de técnicas de tolerancia a fallos en el sectordel consumidor medio, Siewiorek en [7] es tajante al a�rmar que �To succeedin the commodity market, fault-tolerant techniques need to be sought whichwill be transparent to end users�.

Esto es, una técnica de tolerancia a fallos sólo puede ser considerada transpa-rente si, como resultado de su aplicación, el sistema sufre un impacto mínimoen sus prestaciones, consumo de memoria o velocidad de procesamiento. Nonos hemos de olvidar aquí del elemento humano; caso de que dicha técnicarequiera de conocimientos altamente especializados en el área de la toleranciaa fallos podemos concluir, de manera casi automática, que dicha técnica noserá utilizada en absoluto o, lo que resultaría aún peor, que será mal utilizadacreando una falsa con�anza en el sistema resultante.

También hemos de tener en cuenta que este trabajo, aunque guiado pornecesidades de prestaciones y facilidad de uso, está dirigido al diseñador desistemas empotrados. Este es nuestro �usuario �nal�, el elemento humano alque antes aludíamos, y no al consumidor que, de hecho, hace uso de estossistemas.

Es bajo estas premisas que se desarrolla la propuesta que se presenta en estecapítulo. Se trata de una nueva técnica para empotrar las �rmas de referenciaque ha de usar un procesador de guardia, utilizando como líneas maestras: enprimer lugar, la búsqueda del mínimo impacto en las prestaciones del sistemaresultante y, en segundo lugar, la inclusión del máximo número posible demecanismos de detección de errores.

Para dar forma a esta propuesta, en la sección siguiente se recogen las conclu-siones más importantes obtenidas tras una revisión del estado de la cuestióny se analizan los puntos débiles de las propuestas disponibles en la literaturade la materia. De este análisis se extraen las características fundamentalesde la propuesta, plasmada en la sección 2.3. Junto a la técnica de empotradode �rmas se propone una �rma de referencia con un conjunto de mecanismosde detección de errores; éstos son descritos en la sección 2.3.2. Para �nalizar,en la sección 2.4 se describe el soporte software que resulta necesario para

14

permitir la generación automática de las �rmas de referencia.

2.2. Análisis de las propuestas existentes

De las propuestas preexistentes eliminaremos directamente de nuestro análi-sis aquéllas puramente software por su alto coste en prestaciones y su prác-ticamente nula capacidad para detectar errores producidos por la alteraciónde las instrucciones a ejecutar.

Tampoco extenderemos dicho análisis a las propuestas que exijan del proce-sador en cuestión elementos que, habiendo sido incorporados en los compu-tadores que usamos a diario, aún están lejos de extenderse a los procesadoresde los sistemas empotrados sobre los que versa este trabajo: la ejecución demúltiples instrucciones por ciclo de reloj de los procesadores superescalareso la capacidad de �multithreading�, por citar algunos ejemplos.

Centrándonos pues en las propuestas con un importante soporte de hardwaresobre procesadores escalares, los principales puntos débiles son, en nuestraopinión:

La falta de soporte de mecanismos cruciales para procesadores de pro-pósito general. Así, por ejemplo, en la propuesta para el controlador decomunicaciones TTP se elimina del procesador la capacidad para tra-tar excepciones e interrupciones. Es bien cierto que, para el propósitoespecí�co de dicho controlador de comunicaciones, la tarea puede muybien resolverse sin estos mecanismos. Pero también es cierto que unprocesador de propósito general sin ellos es, hoy por hoy, inconcebible.

La falta de soporte software para automatizar el proceso de instrumen-tación de los programas que deben ser veri�cados durante la ejecución.Podemos traer a colación aquí, como buen ejemplo de lo mencionado,la propuesta del Instruction Checker Module que se hace en [8]. Enésta, es necesario incorporar una instrucción especí�ca de CHECK in-mediatamente antes de la instrucción a veri�car. Aunque hay una vagareferencia a que la veri�cación debe hacerse sobre las partes críticas delsoftware que el procesador debe ejecutar, no hay establecida una víaclara que el programador pudiera utilizar desde un lenguaje de alto ni-vel para conseguir tal �n. Y, a la hora de realizar los experimentos sobreun modelo de simulación, las instrucciones de CHECK son insertadas

15

on-the-�y cuando se detecta la ejecución de una de esas instruccio-nes críticas. La justi�cación de este método en lugar de incorporar lassusodichas instrucciones al código original es, justamente, la falta desoporte software. También se utiliza la misma justi�cación en [9] paraomitir los resultados de su procesador de guardia en los experimen-tos de inyección de fallos. Aunque, de forma puntual, esta justi�caciónpuede ser facilmente entendida y aceptada, lo cierto es que no deja degenerar dudas sobre la viabilidad de dichas propuestas, que debería serdespejada por los autores. En cualquier caso, una nueva propuesta nodebería tomar ese mismo camino, sino �jar de forma inequívoca cómose ha de usar el software de apoyo, y cómo éste genera la informaciónnecesaria para que los programas sean instrumentados.

La inclusión del cálculo y comprobación de las �rmas en tiempo deejecución en las unidades de ejecución del procesador. Esta inclusiónpuede, potencialmente, afectar negativamente a la frecuencia máximade funcionamiento al incorporar nueva circuitería a las unidades deejecución del procesador cuyo funcionamiento se pretende monitorizar.Si los circuitos dedicados a la detección de errores se introducen sobrela ruta de mayor coste en las etapas de cálculo, es posible que hayaque reducir la frecuencia del sistema para cumplir los nuevos requisitostemporales. Es necesario dejar claro que esta penalización es potencialy que sólo tras cada implementación de la técnica en cuestión se puededeterminar si dicha penalización es real; el problema es que, si paraentonces se demuestra que dicha penalización existe, es demasiado tardepara corregirla.

La inclusión de nuevas instrucciones para el procesador a veri�car, demanera que sea el propio procesador el que realice, como parte inherentede la ejecución de un programa instrumentado, los cálculos de la �rmaen tiempo de ejecución, la comparación con la �rma de referencia, etc.En este apartado podemos incluir aquellas propuestas que, dejandoel cálculo y la comprobación de las �rmas en tiempo de ejecución amódulos especializados separados del procesador, exigen que éste traigalas nuevas instrucciones de memoria y las �ejecute�, inyectándolas enel pipeline como parte del programa de la aplicación. Sea como fuere,el coste en prestaciones es evidente si hacemos caso a lo indicado en[10] sobre la longitud de los bloques de instrucciones: entre 7 y 8 demedia según los autores. Añádase a cada uno de estos bloques unainstrucción de no operación (ninguna otra instrucción puede tener unimpacto menor) y tendremos una idea aproximada sobre la pérdida de

16

prestaciones en la que se incurre al utilizar estas propuestas.

La ubicación del procesador de guardia excesivamente lejos del pro-cesador principal. En efecto, en algunas propuestas el procesador deguardia se conecta al bus de acceso a memoria, utilizando técnicas de�snooping� para recuperar las �rmas. Sin embargo, no es el procesadorprincipal el que está directamente conectado a este mismo bus, sino elcontrolador de la memoria caché de primer nivel. De esta manera, sepierde visibilidad respecto del �ujo de ejecución del programa que seintenta veri�car; para compensar esta pérdida, se modi�ca el controla-dor de la memoria caché para que emita, hacia el mencionado bus, las�rmas que el procesador de guardia necesita observar. Se mantiene sinembargo una importante pérdida de visibilidad que impide al procesa-dor de guardia veri�car si las instrucciones ejecutadas por el procesadorprincipal se han corrompido o no. Sólo puede, entonces, realizar unaveri�cación de la estructura del programa; se pierde en este caso laventaja del soporte hardware y se consigue la veri�cación, únicamente,de la integridad en la estructura del programa.

Aparte de estos puntos débiles observados en las propuestas que se puedenencontrar en la literatura de la materia, y que habrán de ser evitados en lamedida de lo posible, existen algunos aspectos de importancia que no debenser olvidados a la hora de realizar un nuevo planteamiento.

Uno de estos aspectos es la ubicación de las �rmas en memoria. Existen dostendencias a este respecto. La primera consiste en separar completamente las�rmas del programa que se pretende veri�car, utilizando memorias separadaso, al menos, espacios distintos de la memoria del sistema. La segunda intercalalas �rmas de referencia del procesador de guardia entre las instrucciones delprograma que se pretende veri�car.

En el primer caso, el objetivo es evitar las interferencias que las �rmas ten-drían sobre la ejecución del programa si éstas se intercalaran entre las ins-trucciones del propio programa. Se crea entonces un nuevo problema, ¾cómose asocia una �rma al bloque de instrucciones del que ha de servir de re-ferencia? Este problema, en general, se resuelve haciendo que el procesadorde guardia ejecute un programa con un mínimo juego de instrucciones quele permita, al menos, mimetizar la estructura del programa del procesadorprincipal.

En el segundo caso, el objetivo es justamente eliminar la sobrecarga que su-pone hacer que el procesador de guardia tenga que ejecutar un programa con

17

la misma estructura que el programa principal. En efecto, si la �rma �acom-paña� de algún modo al bloque de instrucciones, el procesador de guardia nonecesita realizar ningún tipo de salto cuando el procesador principal ejecuteuna instrucción de salto: allá donde el procesador principal continúe la eje-cución el procesador de guardia encontrará la �rma de referencia asociada alas nuevas instrucciones.

Un segundo aspecto a tener en cuenta es cómo gestionar los saltos entrebloques del procesador principal. En la inmensa mayoría de las propuestas, losúnicos saltos que se tratan son los saltos condicionales e incondicionales conun único destino posible. Es decir, no se tienen en cuenta estructuras de saltogeneradas desde un lenguaje de alto nivel que produzcan saltos a múltiplesposibles destinos. Una de las escasas excepciones a esta simpli�cación utilizauna lista de posibles destinos para un salto múltiple. Ya los propios autoresidenti�can dos problemas prácticos: el primero, que la lista de saltos es �nita,de manera que no todos los escenarios de salto están contemplados; el segundoes la necesaria complejidad del procesador de guardia, que ha de ser capazde realizar múltiples comparaciones entre la dirección de salto del procesadorprincipal y las diferentes entradas de la lista para determinar si el salto escorrecto o no, manteniendo al mismo tiempo la capacidad de analizar lasinstrucciones del procesador principal al ritmo al que éste las ejecuta.

Aunque, en principio, podría ser aceptable que, dadas las características detolerancia a fallos del sistema que se está diseñando, se prohibiera a cambio eluso de este tipo de saltos, la realidad es que éstos son absolutamente comunesy necesarios para cualquier lenguaje de programación.

Piénsese que, cuando un programa termina la ejecución de una subrutina ofunción y se produce el retorno a la zona de programa llamante, dicho retornoes, desde el punto de vista del procesador de guardia, un salto, una ruptura dela secuencialidad. Puesto que una de las razones del uso de los procedimientoses la reutilización de código, es imposible asignar un único destino cuandose produce el retorno; sería tanto como forzar a que un procedimiento sólopudiera ser ejecutado desde un único punto llamante.

Para permitir la utlización de funciones, algunos trabajos (véase a este res-pecto, por ejemplo, el trabajo descrito en [6]) proponen la instrumentación delos programas a monitorizar con instrucciones adicionales para manipular lapila del sistema y salvar así la necesaria información para que el procesadorde guardia pueda veri�car que el retorno es correcto.

Otros trabajos, como el descrito en [11] proponen, para el caso de saltos

18

con múltiples destinos, utilizar �rmas de justi�cación y retrasar la veri�ca-ción hasta que el programa, utilizando cualquiera de los múltiples caminos,alcance un punto común. Las �rmas de justi�cación intercaladas hasta al-canzar dicho punto común consiguen que la �rma en tiempo de ejecución seala misma, independientemente del camino escogido por el procesador prin-cipal. Retrasar la veri�cación hasta dicho punto común tiene, como primerinconveniente, el notable incremento en la latencia de detección; también lacobertura se ve perjudicada, pues la secuencia de instrucciones puede llegara ser bastante grande. Por último, resulta difícil imaginar cómo se podríaaplicar esta técnica al caso de los retornos de procedimiento; esta técnicaestá más orientada a algunas estructuras de control de �ujo de los lenguajesde alto nivel que se adaptan mejor a una rami�cación múltiple y una relati-vamente pronta reunión en un punto común, como la sentencia condicionalmúltiple switch que se puede encontrar en el lenguaje C.

2.3. Propuesta

Para permitir la veri�cación de las instrucciones ejecutadas por el procesadorprincipal (detectando así los denominados errores intra-bloque), las �rmas autilizar son calculadas utilizando las secuencias binarias de dichas instrucc-ciones como datos de entrada a un LFSR (Linear Feedback Shift Register).

Estas �rmas de referencia se ubican, en esta propuesta, junto al bloque se-cuencial asociado. Pero, a diferencia de la mayoría de las propuestas previas,la �rma no utiliza el delay slot tras la instrucción de salto que termina, ha-bitualmente, cada bloque. Por el contrario, la �rma se ubica al comienzo delbloque.

Ubicar la �rma precediendo al bloque asociado persigue un doble objetivo.Por un lado, podremos incluir un campo con la longitud exacta del bloque,en instrucciones. Esta longitud va a permitir que el procesador de guardia de-tecte todos los errores de inserción y borrado de salto, sin tener que esperar aque el bloque �nalice (quizá incorrectamente) para realizar dicha veri�cación.

Por otro lado, esta ubicación nos va a permitir minimizar la pérdida deprestaciones. Para conseguirlo, bastará con que el procesador principal notenga conocimiento de la existencia de las �rmas. No utilizando una memoriaseparada, como en algunas propuestas previas, sino permitiendo la existenciade dos secuencias de ejecución intercaladas entre sí, pero al mismo tiempo

19

completamente separadas. Una de estas secuencias es la original del programaa veri�car, la de las instrucciones del procesador principal. La otra contieneúnicamente las �rmas de referencia para que el procesador de guardia realicela detección de errores.

Esta idea, núcleo germinal de la propuesta de este trabajo, es la que danombre a la técnica de empotrado de �rmas que aquí se presenta: ISIS,acrónimo del inglés Interleaved Signature Instruction Stream, que podríamostraducir como Secuencia de Instrucciones de Firma Intercalada.

No basta, sin embargo, con ubicar la �rma precediendo al bloque de ins-trucciones asociado para conseguir esta disociación. Se elimina solamente elproblema cuando el procesador realiza un salto, pues éste se produce a laprimera instrucción del bloque, evitando sencillamente toda referencia a la�rma previa. Para conseguir la total independencia de las dos secuencias,aún es necesario determinar cómo el procesador principal puede �esquivar� la�rma de referencia del bloque siguiente cuando, durante la ejecución de unsalto condicional, por ejemplo, la condición no se cumple y el programa siguela ejecución de forma secuencial. En este caso, entre el bloque que terminacon la instrucción de salto condicional y el siguiente se intercala la �rma dereferencia de este último. Si permitimos al procesador principal continuar laejecución secuencial sin más, la primera �instrucción� del bloque siguiente noserá tal, sino su �rma de referencia.

¾Cómo evitarlo? Con otra idea novedosa: modi�car la semántica de las ins-trucciones de salto condicional del procesador. No se trata de cambiar lacodi�cación de dichas instrucciones, sino de cómo se comporta el procesadordurante su ejecución.

Para conseguir que el procesador evite la ejecución de la �rma del bloquesiguiente, se transforma el signi�cado de la instrucción se salto condicionaltradicional, de

CP =

destino si la condición de salto se cumple

CP + 1 si la condición de salto no se cumple

donde CP es el Contador de Programa y hace referencia a la dirección dememoria de la instrucción que se está ejecutando y CP+1 es la posición dememoria siguiente a la actual; este comportamiento tradicional se transforma,como decíamos, en

20

CP =



creando de esta forma un hueco en la dirección CP +1, justamente donde seha de ubicar la �rma del bloque siguiente que comienza, ahora, en la direcciónCP + 2.

Téngase en cuenta que el hueco así creado, y aprovechado para insertar la�rma de referencia del bloque siguiente, no supone coste de ejecución algunopara el procesador principal, ni se ha de modi�car el proceso de compilaciónde los lenguajes de alto nivel para conseguirlo. Para el procesador principaldicho hueco sencillamente no existe, pues no realiza un salto como tal (unsalto como resultado de la ejecución de una instrucción de salto incondicionalo condicional con la condición de salto a valor cierto, se entiende).

Aún no resuelve este cambio semántico todos los problemas. Para una des-cripción completa de cómo se han disociado las dos secuencias de instruccio-nes, véase la sección 2.3.3. Sí es importante destacar, sin embargo, que no serequiere de la arquitectura sobre la que se pretenda aplicar la técnica ISIS deninguna característica especí�ca, ni tampoco es necesario modi�car el juegode instrucciones.

El tratamiento que se da a los saltos con múltiples destinos, incluyendo losretornos de procedimiento, también es novedoso, y resuelve el problema delas múltiples veri�caciones de forma sencilla y elegante. En el caso de que unsalto tenga múltiples destinos, todos conocidos en el momento de generar las�rmas, la veri�cación de que el salto se ha realizado correctamente se pos-pone hasta alcanzar el bloque destino. Allí, la �rma de referencia del bloquecontendrá la información necesaria para que el procesador de guardia puedaveri�car que el origen del salto es el correcto. De esta forma, el procesador deguardia no necesita realizar un conjunto de comprobaciones de una lista máso menos larga, sino solamente una. Cada uno de los posibles destinos contienela información que permite veri�car el origien de dicho salto, repartiendo asípor todo el programa la �lista� de destinos posibles. Al trasladar la veri�ca-ción al bloque alcanzado se soluciona el problema de los saltos con múltiplesdestinos, aunque no completamente. Para una descripción exhaustiva de losposibles escenarios, y cuáles quedan aún sin resolver, véase la sección 2.3.4

21

Figura 2.1: Codi�cación de la �rma de referencia

2.3.1. Descripción de la �rma de referencia

Apoyada sobre la técnica de empotrado de �rmas se ha de�nido una �rmade referencia con un conjunto de campos que pretende maximizar el númerode mecanismos de detección de errores disponibles. La palabra que contienela �rma de referencia de un bloque de instrucciones contiene los siguientescampos (véase la �gura 2.1 para una descripción de dicha palabra):

1. Tipo de bloque. La codi�cación del tipo de bloque se ha elegido de for-ma que no pueda ser confundida con una instrucción por el procesadorprincipal. Esta codi�cación, siendo recomendable, no es imprescindi-ble ni siquiera es siempre posible, pues depende del número de códigosde operación disponibles en el juego de instrucciones. En caso de poderrealizarse, el sistema dispondrá de un mecanismo adicional de detecciónde errores para los saltos ejecutados por el procesador. Este mecanismose ha denominado inicio de bloque y su descripción se puede consultaren la sección 2.3.2.

Resulta obvio, sin embargo, que si la codi�cación del tipo de �rma nopuede diferenciarse completamente de la codi�cación de las instruc-ciones del procesador principal, el mecanismo de detección de erroresmencionado no puede aplicarse.

En cuaquier caso, la información que contiene este campo permite cla-si�car el bloque de instrucciones del procesador principal, y en funciónde dicha clasi�cación, realizar una veri�cación u otra del salto.

Por un lado, los bloques se clasi�can según el tipo de salto con el que�nalizan en

El bloque acaba con un salto incondicional. En este caso, no sepermite la ejecución secuencial al �nalizar el bloque.

El bloque acaba con un salto condicional. Las veri�caciones a rea-lizar son análogas al caso anterior, con la diferencia de que seacepta que el procesador principal continúe la ejecución secuen-cial en lugar de saltar.

y, de forma independiente, según los destinos en

22

El bloque acaba con un salto simple. En este caso, y puesto quehay un único destino, la propia �rma de referencia incorpora lainformación de veri�cación del salto en el campo dirección destino.

El bloque acaba con un salto múltiple. La veri�cación del saltoen este caso se pospone hasta alcanzar el bloque destino. Una vezallí, se utilizará el campo dirección origen de la �rma asociada albloque destino para dicha veri�cación.

El bloque acaba con un salto no cubierto. Se trata aquí de darcabida a los escenarios de saltos múltiples no cubiertos, y quequedan aún por resolver (ver sección 2.3.4). En lugar de prohibirdichos escenarios, se permite su existencia a costa de abrir unaventana de incertidumbre sobre el salto que el procesador realizaal �nalizar uno de estos bloques.

Finalmente, en otra clasi�cación independiente de las anteriores, secaracteriza el bloque origen desde el que se alcanza el bloque actual:

Si el bloque es alcanzado desde uno o más bloques de salto simples,la veri�cación del salto que proporciona el campo Dirección origenno ha de realizarse.

Si el bloque es alcanzado desde un bloque de salto múltiple, laveri�cación del salto que proporciona el campo Dirección origenha de realizarse para veri�car que el salto es correcto.

2. Dirección destino. Los bits de este campo permiten veri�car que elsalto realizado por el procesador principal es correcto. La diferenciaentre las direcciones de la instrucción de salto que �naliza el bloqueen curso y la que resulta como destino del salto (en caso de un saltocondicional, si es que la condición es cierta) se compacta utilizandoun sencillo árbol de puertas lógicas xor que permite veri�car que elsalto que va a ejecutar el procesador para �nalizar el bloque actual escorrecto. La descripción del cálculo de los bits de este campo, y delsiguiente (dirección origen), que se realiza exactamente de la mismaforma, puede consultarse en la sección 2.3.2

3. Dirección origen. Los bits de este campo permiten veri�car que elsalto realizado por el procesador principal es correcto. Aunque el pro-ceso de cálculo de los bits de este campo coincide con el campo anterior(dirección destino), en este caso el bloque actual es uno de los posiblesdestinos que se alcanza desde un bloque origen con múltiples destinos.

23

Así pues, la veri�cación que realiza el procesador de guardia con estecampo se realiza tras el salto.

4. Longitud. El procesador de guardia veri�ca, cuando la última instruc-ción del bloque es retirada del pipeline, que el número de instruccionesejecutadas desde el inicio del bloque es el correcto. La detección de quela instrucción corresponde a la última de un bloque la realiza el proce-sador de guardia comprobando que el procesador principal ha realizadoun salto, una ruptura de secuencia.

5. Firma derivada. En este campo se almacena el resultado de un regis-tro LFSR alimentado con la secuencia binaria de las instrucciones delbloque, que consiste en la �rma asignada del bloque.

Ha de tenerse en cuenta que el tamaño de estos campos es orientativo, y quedeben realizarse experimentos para caracterizar los mecanismos de detecciónde errores y determinar si el número de bits asignados a cada campo es eladecuado.

En particular, el campo tipo de bloque se ha escogido del mismo tamaño queel código de operación de las instrucciones de la arquitectura MIPS sobre laque se realiza el desarrollo práctico (véase a este respecto el capítulo 3). Por lamisma razón, el tamaño total de la �rma de referencia es de 32 bits, para quecoincida con el tamaño de las instrucciones de la mencionada arquitectura.

2.3.2. Mecanismos de detección de errores

Con los campos de la �rma de referencia, los mecanismos de detección deerrrores que se incorporan al procesador de guardia son

1. Inicio de bloque. Aunque no siempre es posible, resulta muy recomen-dable codi�car los tipos de �rma de referencia de forma completamentediferenciada de las instrucciones del procesador principal. Esta codi�-cación permite que el procesador de guardia incorpore un mecanismode detección de errores de los saltos ejecutados por el procesador prin-cipal que se ha denominado inicio de bloque. El funcionamiento de estemecanismo de detección de errores es muy simple: en el caso de que elprocesador salte a una dirección errónea y alcance una instrucción queno sea el inicio de un bloque de instrucciones, la posición de memoria

24

inmediatamente anterior a ésta no contendrá una �rma de referenciapara el procesador de guardia; y puesto que el procesador de guardiautiliza las direcciones del procesador principal como base para obtenerla �rma de referencia del bloque que éste está ejecutando, y gracias ala diferenciación total en la codi�cación, el procesador de guardia pue-de fácilmente detectar el error al encontrar un tipo de �rma que nose corresponde con ninguno de los tipos conocidos. De forma análoga,también el procesador principal puede detectar directamente el error yactivar una excepción de tipo instrucción ilegal en caso de que el saltoerróneo alcance, como destino del mismo, una �rma de referencia.

2. Tipo de bloque. Aunque el tipo de bloque, por sí mismo, no es uncampo que permita hacer ninguna clase de veri�cación de la integridaddel �ujo de ejecución (a excepción de la indicación de salto condicionalo incondicional), la secuencia de tipos entre las �rmas de referencia debloques consecutivos sí permite determinar si la estructura del progra-ma ejecutado por el procesador es correcta, pues no todas las combi-naciones son aceptables. Por ejemplo, si en un bloque la �rma indicaque la veri�cación del salto debe posponerse hasta que el procesadoralcance el bloque destino, la �rma asociada a dicho bloque destino debeindicar, necesariamente, que se trata de un bloque alcanzado desde unsalto con múltiples destinos y que ha de realizarse la veri�cación deque el origen es correcto. De la misma forma, si un bloque acaba en unsalto con un único destino, el bloque destino no puede indicar que seha de realizar la veri�cación del bloque origen del salto.

3. Dirección de bloque. Bien sea veri�cando el bloque que se ha dealcanzar al �nalizar el actual (utilizando el campo dirección destino)o el bloque origen desde el que el actual ha sido alcanzado (utilizandoel campo dirección origen), la veri�cación de los saltos del procesadorprincipal se realiza de la misma manera. Para calcular el valor de refe-rencia se utiliza la diferencia entre las instrucciones origen y destino delsalto, igual que el valor calculado por el procesador de guardia en tiem-po de ejecución. Este valor es compactado en ambos casos mediantepuertas lógicas xor para reducir la longitud (en bits) hasta acomodarsea la longitud del campo correspondiente.

4. Longitud de bloque. Este es uno de los mecanismos más sencillos.Se trata de un simple contador de instrucciones que se decrementa concada instrucción ejecutada por el procesador principal. Si el contadorllega a cero y no se produce una ruptura de secuencia, se ha producido

25

un error de borrado de salto. Si por el contrario la ruptura de secuen-cia se produce antes de que el contador llegue a cero, se ha producidoun error de inserción de salto. Nótese que, para esta veri�cación, lossaltos condicionales en los que la condición de salto no se cumple (ypor tanto, no se toma el salto) se procesan exactamente igual que lossaltos incondicionales. Dada la estructura del programa del procesadorprincipal, y gracias al hueco creado en la secuencia de direcciones re-ferenciadas por dicho procesador en el caso de un salto condicional notomado, el procesador de guardia observa una ruptura de la secuenciade instrucciones independientemente del tipo de salto, o de si el saltose toma o no.

5. Integridad de las instrucciones. Con la �rma derivada utilizadacomo referencia, y utilizando el mismo cálculo LFSR con las instruc-ciones que realmente ejecuta el procesador, el procesador de guardiapuede veri�car que las instrucciones se han mantenido íntegras durantesu ejecución. Para conseguirlo, es necesario que las instrucciones seananalizadas por el procesador de guardia cuando éstas son retiradas delpipeline, y no cuando entran, pues durante las diferentes etapas delpipeline la instrucción se almacena en registros internos susceptibles desufrir errores.

Hay que tener en cuenta que no son estos los únicos mecanismos de detecciónde errores de un sistema que implementa la técnica ISIS; son sólo los añadidospor el procesador de guardia. Pero la mayoría de procesadores incorporanuno o más mecanismos de veri�cación que fuerzan el tratamiento del errormediante una excepción, entre los que se puede citar

Ejecución de instrucción ilegal.

Búsqueda de instrucción en dirección no alineada en frontera de pala-bra.

Búsqueda de instrucción en una dirección de página no mapeada en laMMU, o sin los adecuados privilegios de acceso.

Cálculo del campo de dirección

Como ya se ha mencionado anteriormente, los bits de referencia para la ve-ri�cación de un salto se calculan mediante un árbol de puertas lógicas xor, a

26

partir de la distancia (en instrucciones) entre la instrucción de salto y la dedestino del mismo.

Dicha diferencia, que denominaremos V , se compacta utilizando sus bits deforma alternada para conseguir una máxima cobertura, rellenando con cerospor la izquierda si resulta necesario. Algorítmicamente, podemos expresarcada uno de los bits de referencia para el salto como

para i = 0 hasta K − 1gi = 0para j = 0 hasta dL/Ke

gi =

gi ⊕ Vi+Kj si i+Kj < L

gi ⊕ 0 si i+Kj ≥ L�n para

�n para

donde gi es cada uno de los bits de referencia, K es el número total de dichosbits (e igual al tamaño del campo correspondiente en la �rma) y L es lalongitud, en bits, de la diferencia entre direcciones, V .

Figura 2.2: Árbol de puertas xor para la compactación del salto para uncampo de 3 bits

En la �gura 2.2 se puede observar un ejemplo de este cálculo aplicado al casoK = 3, V = 32. Esta expresión de los bits de referencia resalta dos cualidadesdel proceso de cálculo, a saber:

Que una secuencia de bits contiguos todos ellos erróneos y de una lon-gitud inferior a 2K que afecte al valor V será, con toda seguridad,detectada por el mecanismo. Es evidente que, si una secuencia de bitserróneos (del valor V ) tiene una longitud inferior a 2K, al menos unode los bits calculados gi está afectado por un único bit erróneo; y el

27

cálculo realizado con puertas xor garantiza que, con un único bit erró-neo, el resultado diferirá, con toda seguridad, del de referencia (en esebit particular).

Que el número de bits de referencia, K, puede ser cambiado fácilmentebuscando, por ejemplo, aumentar la cobertura ante errores múltiples,según lo expuesto anteriormente.

Hay que tener en cuenta que, aunque podemos partir de la suposición gene-ralmente aceptada de la existencia de errores simples (de un único bit) comolos más comunes con enorme diferencia, lo cierto es que la alteración de unúnico bit en uno de los operandos puede dar lugar, a través del proceso decálculo del destino del salto que hace el procesador, a un error múltiple enel resultado. Y es a partir de este resultado que el procesador de guardiaobtiene el valor V .

Se hace por ello necesario, en este caso particular de los saltos del procesador,poner un especial énfasis en la más que habitual existencia de errores demás de un bit, errores cuya detección resulta primordial en un sistema quepretenda ofrecer garantías de que el �ujo de ejecución del procesador que seestá monitorizando se mantiene íntegro.

2.3.3. Modi�caciones al procesador y a los programas

En la sección 2.3 ya se han adelantado algunos detalles de cómo se ha con-seguido mantener aisladas las dos secuencias de instrucciones que coexistenen un sistema que utilice la técnica ISIS. En esta sección se revisan estosdetalles y se completan para conseguir una descripción completa de las mo-di�caciones que es necesario realizar, tanto al procesador como al programaque éste ejecuta, para conseguir dicho aislamiento.

Se puede adelantar ya que algunas de estas modi�caciones, en especial lasque hacen referencia a la inserción de instrucciones en el programa originalque el procesador principal ha de ejecutar, dan lugar a una de las causas dela pérdida de prestaciones que se analiza en el capítulo 13.

28

Aclaraciones previas

Antes de comenzar con la descripción detallada de estas modi�caciones, esnecesario realizar unas aclaraciones previas.

Sin restar generalidad a la propuesta, se describen modi�caciones y solucionesa la arquitectura del procesador principal y a los programas que éste ejecuta,utilizando un ejemplo de implementación con las siguientes características:

El tamaño de la �rma de referencia coincide con el tamaño de unainstrucción, y con una posición de memoria. Es decir, cuando se estáejecutando la instrucción en la dirección A y se desea hacer referenciaa la instrucción siguiente, se indica como A+ 1.

La ejecución de una instrucción de salto no implica la ejecución dela instrucción que le sigue inmediatamente, se realice el salto o no, alcontrario de lo que ocurre en la mayoría de los procesadores actuales(con pipeline). Esta instrucción, que ocuparía el denominado delay slot,es típica de procesadores con pipeline, ya que para cuando el procesadordetermina que el salto ha de realizarse, la siguiente instrucción (la queocupa el delay slot) ya ha entrado en el pipeline.

Este ejemplo de implementación pretente que la lectura no sea en excesofarragosa (y, en este punto, el autor admite sugerencias de todo tipo parahacer el texto algo más ameno y algo menos espeso de lo que es).

Sin embargo, las modi�caciones pueden aplicarse a otros sistemas completa-mente diferentes sin más que realizar las siguientes sustituciones:

1. Si denominamos W al espacio ocupado por una �rma (en posicionesde memoria), cuando se mencionan direcciones de instrucción todas lasreferencias a +1 se han de sustituir por +W .

2. Si el sistema dispone de un pipeline que genera el denominado delay slot,cuando en el texto se menciona al registro CP referido a una instrucciónde salto o de llamada a procedimiento debe entenderse referido a lainstrucción del delay slot correspondiente.

3. En ese mismo caso, cuando en el texto se indica que, gracias a la mo-di�cación del comportamiento de determinadas instrucciones, se creaun hueco para alojar la �rma de referencia en la posición de memoria

29

siguiente a la instrucción de salto, debe entenderse que se crea en laposición de memoria siguiente al delay slot correspondiente.

Modi�caciones

En primer lugar, el ubicar la �rma de referencia precediendo al bloque aso-ciado permite garantizar que, cuando dicho bloque es alcanzado mediante unsalto, la mencionada �rma pase completamente desapercibida para el proce-sador principal.

Queda por resolver, entonces, cómo conseguir que el procesador principal norealice búsquedas de instrucción para recuperar �rmas, cuando el bloque esalcanzado por otros medios. Se pueden distinguir 4 casos diferentes:

1. Aquellos bloques que son alcanzados por la simple ejecución secuencialcuando, al �nalizar un bloque con una instrucción de salto condicio-nal, la condición lógica correspondiente no se cumple y, por tanto, elprocesador no salta.

2. Aquellos bloques que son alcanzados desde un salto, cuando dicho saltoes el retorno de un procedimiento. Este caso requiere un tratamientoespecial, dado que cuando se termina la ejecución del procedimientoy se ejecuta el retorno, la dirección utilizada para volver al progra-ma llamante es calculada de forma automática por el procesador en elmomento de la llamada al procedimiento. No se contemplan aquí lossaltos realizados a direcciones calculadas en tiempo de ejecución cuan-do dicho cálculo está especi�cado de alguna manera como parte delprograma que se está ejecutando y, por tanto, bajo el total control delprogramador.

3. Aquellos bloques que son alcanzados por la simple ejecución secuencialde instrucciones sin que medie instrucción de salto alguna entre ambos.Estos cambios de bloque, denominados fall-through, se producen cuan-do, dentro de una secuencia de instrucciones sin saltos, existe algunainstrucción que es referenciada desde una instrucción de salto. La ins-trucción referenciada se conoce como branch-in y su existencia exigeque la secuencia de instrucciones se divida en dos bloques secuencia-les, uno que acaba con la instrucción previa al branch-in y otro quecomienza justamente con ésta.

30

4. Aquellos bloques que son arti�cialmente divididos en bloques de menortamaño para ajustar su longitud al campo correspondiente de las �rmasde referencia. Aunque la causa de la división es radicalmente diferentedel caso anterior, su tratamiento puede muy bien asimilarse a éste, puesla situación de partida es la misma: es necesario dividir una secuenciade instrucciones y no existe una instrucción de salto, sólo se dispone deuna indicación de por dónde hay que realizar dicha división.

Cada uno de estos casos requiere una atención especí�ca, que se detalla acontinuación.

Para el primer caso, como ya se ha mencionado al presentar la propuestaISIS, se altera la semántica de las instrucciones de salto condicional paraque se genere un hueco en la secuencia de direcciones referenciadas por elprocesador, cuando éste ejecuta un salto condicional y decide que el salto noha lugar.

Figura 2.3: Ejemplo de alteración del comportamiento de una instruccióncondicional: (a) antes de la modi�cación; (b) tras la modi�cación para crearun hueco en la dirección i+1

Este cambio implica modi�car el comportamiento del procesador, de

CP =



que es lo habitual, a

31

CP =



siendo CP el registro Contador de Programa que hace referencia a la direcciónde memoria de la instrucción en curso (la de salto) y CP+1 la posición dememoria siguiente a la actual.

Con este sencillo cambio se crea un hueco, una posición de memoria no re-ferenciada, en la dirección CP + 1, justamente donde se ha de ubicar la�rma del bloque siguiente que comienza, tras la modi�cación, en la direcciónCP + 2.

En la �gura 2.3a se puede observar un ejemplo típico de una instrucción desalto condicional (ubicada en la dirección i) utilizada como parte de una sen-tencia condicional de alto nivel if-then-else. En la �gura 2.3b, tras aplicarla modi�cación mencionada, se crea un hueco en la dirección i+1, precedien-do al bloque else de instrucciones que, con la modi�cación, comienza en ladirección i+ 2.

Para el segundo caso, los retornos de procedimiento, se realiza otra modi�-cación semántica al juego de instrucciones del procesador principal, en estecaso al cálculo automático de la dirección de retorno cuando se realiza unallamada a procedimiento, pasando el comportamiento de las instrucciones dellamada a procedimiento de

retorno = CP + 1CP = destino

a

retorno = CP + 2CP = destino

Creando, otra vez, un hueco en CP+1, esta vez tras la instrucción de llamadaa procedimiento. Hueco que, justamente, precede al bloque de instruccionesque se han de ejecutar cuando se realice el correspondiente retorno.

Para los dos últimos casos, en los que no hay instrucción de salto alguna quesepare los dos bloques, la única solución consiste en modi�car el programa

32

original para insertar, en el punto exacto donde se desea la división entrebloques, una instrucción de salto. Puesto que el salto se introduce únicamentepara conseguir que el procesador �esquive� la �rma que se ha de colocarentre ambos bloques, la instrucción de salto no obedece a la lógica originaldel programa, por lo que i) se utiliza una instrucción de salto incondicionala la dirección CP + 2, y ii) esta instrucción debe ser considerada comouna necesaria sobrecarga del sistema, tanto en espacio de memoria como entiempo de ejecución.

Figura 2.4: Ejemplo de una sentencia if-then-else: (a) alto nivel; (b) blo-ques antes de la inserción de �rmas; (c) tras la inserción de �rmas e instruc-ciones de salto

En la �gura 2.4 se puede observar un ejemplo de estas mo�caciones sobreuna sencilla construcción de alto nivel if-then-else. En la �gura 2.4b seobservan los bloques secuenciales de instrucciones. Los bloques se mantienenseparados para mejorar la legibilidad aunque las instrucciones de los mismosse suceden sin solución de continuidad. También se ha añadido a la derecha,una columna con algunas de las direcciones de instrucción, para evidenciardicha continuidad.

En la �gura 2.4c aparece el programa tras ser instrumentado; las líneas pun-teadas indican el comportamiento que tendría un procesador típico ante unainstrucción de salto condicional. En la �gura se observa, por un lado, quecada bloque es precedido de su correspondiente �rma (en gris claro en la�gura). Y por otro, que la transicion fall-through entre los dos últimos blo-ques se ha sustituido por la inserción forzada de una instrucción de saltoincondicional (en gris oscuro en la �gura, en la dirección k+1) que hace queel procesador salte por encima de la �rma de referencia (almacenada en la

33

dirección k + 2 en la �gura) del último bloque, que comienza en la direcciónk + 3 en la �gura. Puesto que la instrucción es ejecutada por el procesadorprincipal, es una instrucción más del bloque correspondiente.

2.3.4. Tratamiento de los saltos

Con ISIS se amplía el número de escenarios cubiertos por los mecanismosde detección de errores respecto de las propuestas previas. Sin embargo, notodos los saltos pueden ser veri�cados en tiempo de ejecución.

Como premisa inicial, para poder calcular los bits de referencia antes de laejecución, es necesario que el destino o destinos de la instrucción de saltosea(n) conocido(s). Es esta una premisa que acompaña a todas las técnicasde ver�cación de la integridad estructural de un programa, incluída la técnicaque aquí se propone.

Dando por supuesto que la premisa anterior se cumple, ISIS permite la ve-ri�cación de algunos saltos con múltiples destinos mediante la veri�cación,utilizando la �rma de referencia del bloque destino, de que el origen es correc-to. Posponer la veri�cación hasta obtener la �rma del bloque destino permite,de forma sencilla, eliminar el mayor obstáculo que impedía, tradicionalmente,veri�car un conjunto de direcciones de salto de referencia.

Sin embargo, no todos los saltos pueden ser veri�cados. Para entender losposibles escenarios que se pueden dar en una aplicación, y de cuáles puedenser veri�cados mediante los mecanismos incorporados con ISIS, es necesariocrear una pequeña taxonomía de los tipos de salto:

1. Salto incondicional simple. Aquel salto que siempre se toma y que tieneun único destino. En este tipo de saltos se encuentran también lasinstrucciones de llamada a subrutina o salto a procedimiento.

2. Salto incondicional múltiple. Aquel salto que siempre se toma, y quetiene un conjunto �nito y conocido de posibles destinos, de los cua-les uno de ellos es seleccionado en tiempo de ejecución. En este casose incluyen los retornos de procedimiento. Estrictamente hablando, yaunque son instrucciones de ruptura de la secuencia de ejecución, lasinstrucciones de retorno de procedimiento no están catalogadas comoinstrucciones de salto. Sin embargo, y para el desarrollo de mecanis-mos de veri�cación de un procesador de guardia, cualquier ruptura desecuencia puede asimilarse a un salto.

34

3. Salto condicional simple. Aquel salto que puede tomarse o no, pero queen caso de tomarse tiene un único destino posible y conocido. Aunquepara los mecanismos de veri�cación de ISIS podría considerarse factibleel caso de una ejecución condicional de procedimiento, esta posibilidadno se ha tenido en cuenta en el desarrollo por no encontrarse dentro delconjunto de instrucciones de la arquitectura MIPS, base de desarrollode la parte práctica de este trabajo (véase el capítulo 3).

4. Salto condicional múltiple. Aquel salto condicional que, en caso de to-mar el salto, tiene un conjunto �nito y conocido de posibles destinos,uno de los cuales es seleccionado en tiempo de ejecución. Tampoco eneste caso se ha considerado la posibilidad de una instrucción de retornocondicional, aunque, igual que en el caso anterior, si está soportada poruna arquitectura y se dispone del soporte software correspondiente, losmecanismos de veri�cación incorporados en ISIS podrían tratar estossaltos de forma natural.

y otra taxonomía de los tipos de destinos que se pueden alcanzar con unsalto:

1. Branch-in alcanzado desde una o más instrucciones de salto simples.

2. Branch-in alcanzado desde una única instrucción de salto múltiple.

3. Branch-in alcanzado desde una o más instrucciones de salto simples yuna única instrucción de salto múltiple.

4. Branch-in alcanzado desde dos o más instrucciones de salto múltiple.

5. Branch-in alcanzado desde una o más instrucciones de salto simples ydos o más instrucciones de salto múltiple.

En esta última clasi�cación, y para simpli�car, se ha omitido deliberadamentela mención al tipo de salto (condicional o incondicional) por tener el mismotratamiento.

El primer caso es el único que, de forma mayoritaria, tenía tratamiento hastaahora en las propuestas existentes en la literatura. El tratamiento con ISIS essimilar a estas propuestas: en la �rma de referencia del bloque origen (dondese encuentra la instrucción de salto) se incluyen los bits de referencia paraveri�car el salto desde el origen, en el campo dirección destino, y se indica la

35

veri�cación en el tipo; en la �rma del bloque destino el contenido del campodirección origen es indiferente, pues el tipo de �rma indica que la veri�cacióndel origen no ha lugar.

El segundo caso se trata en ISIS retrasando la veri�cación hasta que el pro-cesador de guardia dispone de la �rma de referencia del bloque destino. Paraello, en la �rma de referencia del bloque origen el contenido del campo di-rección destino es indiferente y no se usa, lo que se indica con el tipo corres-pondiente; en la �rma del bloque destino el campo dirección origen contienelos bits de referencia del salto, y en el tipo de �rma se le indica al procesadorde guardia que ha de veri�car que el origen del salto es correcto.

En el tercer caso, para permitir que los bloques que contienen los saltos sim-ples hagan la veri�cación utilizando la �rma del bloque origen y, cuando elsalto se produce desde el bloque con salto múltiple, se realice la veri�ca-ción con la �rma del bloque destino, se ha de llegar a un compromiso entrefuncionalidad y cobertura de detección.

Cuando el salto tiene lugar desde el bloque con salto múltiple esta veri�caciónsólo puede realizarse, igual que en el caso anterior, una vez alcanzado elbloque destino y con su �rma. Esto fuerza a que el bloque destino indique,mediante los bits correspondientes del campo tipo que ha de veri�carse elorigen (que ha de corresponder al bloque con el salto múltiple).

Sin embargo, cuando el salto se produce desde uno de los bloques con sal-to sencillo, la veri�cación se hace utilizando la �rma del bloque origen; elprocesador de guardia encuentra, sin embargo, cuando obtiene la �rma delbloque destino, que el tipo de ésta indica que se ha de realizar la veri�ca-ción del origen del salto utlizando los bits dirección origen, lo cual es unaincongruencia.

Llegados a este punto, se podían haber tomado dos caminos distintos en loque a especi�cación del comportamiento del procesador de guardia se re�ere,a saber: a) No contemplar este escenario como posible, y entender que, llegadoel caso, la secuencia de �rmas debería indicar un error en el �ujo de ejecución,o b) incorporar este escenario como uno de los posibles, haciendo que elprocesador de guardia acepte esta incongruencia en la secuencia de �rmascomo válida.

Se ha elegido la segunda opción, mayor funcionalidad o mayor número decasos tratables, a costa de reducir la posibilidad de detección de errores.Efectivamente, como se indica en la sección 2.3.2 el procesador de guardia

36

dispone de un mecanismo para detectar errores en el �ujo de ejecución basadoen la secuencia de tipos de �rmas de referencia de los bloques que el proce-sador principal ejecuta; este mecanismo permite detectar saltos incorrectosteniendo en cuenta que no todas las combinaciones de tipos de �rmas en unasecuencia de ejecución son válidas. Pues bien, a pesar de que la secuencia de�rmas para incorporar este caso al conjunto de escenarios tratados puede darlugar a una de esas secuencias inválidas, se ha retirado este caso del meca-nismo de detección de errores, pasando entonces a ser una secuencia válidamás.

El cuarto caso, un bloque alcanzado desde dos o más instrucciones con saltomúltiple, no puede ser veri�cado con ISIS. Para que la veri�cación pudiera te-ner lugar, sería necesario que el bloque destino tuviera no un campo direcciónorigen, sino uno por cada uno de los bloques con salto múltiple que hacenreferencia a éste como uno de sus posibles destinos, lo que no es posible. Loque se permite con ISIS es la existencia de este caso, a pesar de que no sepueda veri�car el salto. Para ello, los bloques origen (con salto múltiple) noindican la veri�cación del destino en origen (pues, al ser un salto múltiple,exigiría un conjunto de campos dirección destino), y en el bloque destinono se indica la veri�cación del origen (que exigiría un conjunto de camposdirección origen).

Tampoco puede ser veri�cado el salto desde los bloques con salto múltiple delquinto y último escenario, solamente desde aquellos bloques con salto simple.Para éstos la veri�cación se hace con la �rma del bloque origen y su campodirección destino; para los primeros, el único tratamiento posible es el delcaso anterior.

Parecería, con la descripción del tratamiento de los últimos casos, que la veri-�cación de secuencias no válidas de las �rmas de dos bloques consecutivos noes posible, pues el hecho de que un bloque origen indique o no la veri�cacióncon el campo dirección destino no implica un valor concreto de veri�caciónen el destino.

Efectivamente esto es así, pero en el tipo de bloque se ha añadido, además dela indicación de la veri�cación o no del salto con el campo dirección destino,una indicación del tipo de bloque que el procesador de guardia debería en-contrar en el destino. Este indicación fuerza a que el procesador de guardia

37

Cuadro 2.1: Casuística de saltosTipo de �rma

Bloque origen Bloque

Veri�car Veri�car destino

destino origen Veri�car

(en destino) origen Tratamiento

no no no casos 4 ó 5; en este último, el bloque

destino es alcanzado desde uno de los

bloques con salto múltiple

no no sí error

no sí no error

no sí sí caso 2 o caso 3, si el bloque destino es

alcanzado desde el único bloque con

salto múltiple

sí no no casos 1 ó 5; en este último, el bloque

destino es alcanzado desde uno de los

bloques con salto sencillo

sí no sí caso 3; el bloque destino es alcanza-

do desde uno de los bloques de salto

sencillo, y el procesador de guardia,

a pesar de lo indicado en la �rma del

bloque destino, no realiza la veri�ca-

ción de que el origen es correcto

sí sí sí error

sí sí no error

38

requiera (o no) de la �rma de referencia del bloque destino, la veri�cacióndel origen del salto.

La casuística y el tratamiento que en cada caso da el mecanismo de detecciónde errores basado en la secuencia de �rmas se indica en la tabla 2.1, en laque se detallan todos los casos posibles entre dos �rmas consecutivas, a laizquierda la �rma del bloque origen y en el centro la �rma del bloque destino.

En la mencionada tabla, la primera columna indica si el salto desde el bloqueorigen es simple o múltiple: un salto simple se indica como �sí� en la primeracolumna, pues el destino del salto ha de comprobarse con la �rma del bloqueorigen, y un salto múltiple como �no�. En la segunda columna, la indicacióndel tipo de veri�cación del origen que se debe encontrar en la �rma del bloquedestino es la que permite detectar los errores de secuencia, si el bloque destinoindica lo contrario.

Como se puede observar también en dicha tabla, para resolver la incongruen-cia del tercer escenario, el procesador de guardia no realiza la veri�cación delorigen del salto cuando llega a un bloque destino si, utilizando la �rma delbloque origen, se ha veri�cado que el salto es correcto, independientementede lo indicado en la �rma de referencia del bloque destino.

2.4. Soporte software

Uno de los aspectos más importantes que dotan de verosimilitud a una pro-puesta de estas características es el desarrollo del soporte software necesariopara la inserción de las �rmas de referencia.

Para este desarrollo práctico se ha elegido el compilador de libre distribu-ción gcc en su versión 2.95 y el conjunto de programas y librerías que loacompañan binutils en su versión 2.9.5.

La elección de este compilador como punto de partida no es casual. Por unlado, es uno de los pocos compiladores de código abierto; esta característica,junto con el hecho de ser desarrollado y mejorado por un buen número devoluntarios alrededor del globo, hace que sea un software relativamente biendocumentado. Además, también es digno de mención el soporte que estaspersonas ofrecen, de forma completamente altruista, a aquellos que, comoeste autor, se asoman a la ingente cantidad de código fuente que lo conforma.

39

Por otro lado, la estructura interna con la que se ha desarrollado permite,de una forma relativamente sencilla, portar el compilador para dar soportea nuevas arquitecturas, o a nuevos modelos de procesadores de arquitecturasexistentes, permitiendo su integración de forma homogénea.

También hay que resaltar que, con el código fuente de este compilador, essencillo disponer de un compilador cruzado: un compilador cuyos programasestán preparados para ser ejecutados en una arquitectura host, pero quegenera código binario ejecutable sobre una segunda arquitectura o target1.En particular, y entre la multitud de arquitecturas soportadas, se disponede soporte para generar un compilador cruzado con la arquitectura MIPScomo target. Esta arquitectura es la utilizada en el procesador sobre el quese realiza la parte práctica de este trabajo (véase el capítulo 3).

Por último, también es necesario recalcar que el compilador gcc no es unaherramienta puramente académica: es un compilador que compite con muybuenos resultados por los puestos de cabeza en el ránking de los compiladoresmás utilizados en la industria, y muy especialmente en el ámbito de lossistemas empotrados.

En conjunto, se trata de una herramienta bien estructurada, relativamentebien documentada y que está pensada, justamente, para que la inclusión denuevas arquitecturas o modelos de procesadores no sea una tarea imposible.Pero no se equivoque el lector: se trata de una herramienta que, en su conjun-to, requiere una cantidad ingente de código fuente y que, para ser abarcadaen su totalidad, necesitaría de un equipo humano considerable. Afortuna-damente para el autor de este trabajo, la estructura de este compilador hapermitido que las modi�caciones necesarias para la generación de las �rmasde referencia estén bastante localizadas en unos pocos módulos.

2.4.1. Estructura interna del gcc

No es intención de este autor desmenuzar aquí, uno por uno, todos y cada unode los elementos y programas que forman el compilador gcc. Sin embargo,sí resulta importante describir someramente los elementos más importan-

1Aunque no resulta relevante para este trabajo, cabe aquí mencionar una última posi-bilidad del compilador gcc, el cruce canadiense (canadian cross), que incluye el uso de tresarquitecturas diferentes: se trata de un compilador cuyo código fuente se genera (compila)en un host, que se ejecuta en un segundo host, y que genera código binario para una terceraarquitectura, el sistema �nal o target

40

tes que han sido modi�cados como resultado de este trabajo para conseguirla generación automática de las �rmas de referencia para el procesador deguardia.

En primer lugar, cabe describir la estructura del código de este compiladorcomo un conjunto de tres elementos:

La librería estándar de soporte para el lenguaje C, libc.a, que con-tiene el código de las rutinas necesarias para todo programa que seejecute en un entorno con sistema operativo, especialmente en lo quea entrada/salida se re�ere. Para los sistemas empotrados, esta puedeser reemplazada por otra librería, denominada newlib, con un menornúmero de requisitos sobre el sistema operativo destino. Con esta úl-tima también se pueden desarrollar programas para sistemas bare, sinsistema operativo.

El compilador propiamente dicho, que genera código ensamblador y quese apoya en la librería libc.a. En realidad se trata de un conjunto deprogramas, el primero de los cuales es el denominado preprocesador,cpp. Dentro del compilador gcc existe soporte para varios lenguajes dealto nivel, aunque para el trabajo que nos ocupa se ha limitado el usoal lenguaje C.

El conjunto de utilidades conocido como binutils. Entre ellas podemoscitar el programa ensamblador gas, el montador/cargador ld, el depu-rador/simulador gdb, etc. Todas estas utilidades descansan sobre unabase común que da soporte al formato binario de los �cheros generados,tanto �cheros objeto resultado del ensamblado, como los ejecutables re-sultado del montador/cargador. Esta librería común se denomina bfd,y puede, en función de la arquitectura del sistema target, ofrecer so-porte para los formatos a.out, co� (Common Object File Format) y elf(Executable and Linkable Format).

Lo que resulta de vital importancia para la instrumentación de los programascon las �rmas de referencia es que el compilador traduce las sentencias de altonivel (del lenguaje que sea) a ensamblador, generando toda la informaciónpara éste en formato de texto y utilizando etiquetas y nombres (símbolos), yque el proceso de ensamblado y montaje están separados de la compilación.

Es decir, que la instrumentación de los programas no requiere de la modi�ca-ción del compilador, sólo de las herramientas que lo sustentan, las binutils.

41

Y que, cuando en el programa original se especi�ca que se ha de saltar adeterminada instrucción, el compilador genera una instrucción de salto enensamblador en la que el destino se especi�ca con una etiqueta, con un nom-bre (posiblemente generado de forma automática); aunque se modi�que elconjunto de posiciones de memoria intermedias (añadiendo, por ejemplo, las�rmas de referencia), basta con mantener el mismo nombre o etiqueta paraque el salto se produzca a la instrucción correcta.

Es necesario recalcar en este momento que el uso de referencias simbólicas semantiene durante todo el proceso, hasta el último momento si es necesario,gracias al uso de los �x-up records, descritos en la sección 2.4.2. Mantenerestas referencias simbólicas es un requisito imprescindible para dar soporte ala creación de programas a partir de varios �cheros objeto y librerías, puestoque las direcciones de los símbolos externos no se conoce durante el proceso deensamblado. No es, posiblemente, hasta el momento del montaje que se puededeterminar dicha dirección (y se dice entonces que el símbolo se resuelve),momento en el que se puede sustituir la referencia simbólica por la numérica.

2.4.2. Elementos internos de las binutilsmás relevantes

Describir con detalle la estructura interna de los �cheros objeto, y la fun-cionalidad que ofrece el conjunto de herramientas conocido como binutils

queda fuera del alcance de este trabajo.

Dicho esto, tampoco sería posible describir cómo se generan las �rmas deforma automática ni el efecto que el proceso de compilación y ensambladotienen sobre éstas sin describir, acaso someramente, algunos de los elementosinternos. Estos elementos no son utilizados en exclusiva para la generaciónde las �rmas, ni se han creado con tal �n. Ya estaban incorporados a lasbinutils por otras razones; lo que se ha hecho ha sido aprovecharlas en uncaso, o tenerlas en cuenta, en el otro, para que la generación de las �rmassea posible.

Fix-up records

Un �x-up record es un elemento incorporado al formato binario interno utili-zado por las herramientas binutils para mantener una referencia simbólica.Está asociado a una instrucción en particualr, y no se trata solamente demantener la información del símbolo al que se quiere hacer referencia, sino

42

mantener también un indicador, un índice para indexar una tabla de proce-dimientos de cálculo.

Hay que tener en cuenta que, para cuando el símbolo referenciado puede serresuelto y sustituido por su dirección, en muchos casos no se trata de incor-porar al programa esta dirección sin más; es necesario que esta dirección seaprocesada, manipulada e insertada como un campo más de una instrucciónen particular, y que dicha inserción no debe modi�car los demás campos dela misma.

Por ejemplo, en la arquitectura MIPS, en la que todas las instrucciones sonde 32 bits, en los saltos condicionales se especi�ca el destino del salto como eldesplazamiento (con signo y contabilizado en instrucciones de 32 bits) entrela instrucción de salto y el destino; este desplazamiento se incorpora en laparte baja de la instrucción de salto como un campo de 16 bits. Así pues, unainstrucción de salto condicional que hace una referencia simbólica al destinodel salto contiene, desde el momento de su generación, la mitad (superior)de sus 32 bits ya codi�cados de forma permanente.

Cuando el símbolo referenciado puede ser resuelto, el montador necesita de-terminar cómo la dirección de dicho símbolo encaja en la instrucción a laque el �x-up record hace referencia. En el caso de la instrucción del ejemplo,un índice numérico permite al montador indexar una tabla de procedimien-tos de cálculo y modi�cación de instrucciones, y será este procedimiento elque realice el cálculo del desplazamiento y modi�que (la parte baja de) lainstrucción en cuestión.

El mantener separado, por un lado la resolución de los símbolos en direc-ciones y por otro los procedimientos para manipular dichas direcciones eincorporarlas en las instrucciones permite, de manera muy sencilla, incorpo-rar nuevos tipos de cálculos: Basta con crear una nueva entrada en la tablade procedimientos de cálculo y asignarle un índice.

Desde ese momento, cualquier herramienta de las binutils puede insertar oresolver este nuevo tipo de �x-up record. Precisamente lo que la generaciónde �rmas requiere es un nuevo procedimiento de cálculo para los saltos: lageneración de los bits en los campos dirección origen y dirección desino enlas �rmas de referencia.

Es evidente que, para cada arquitectura soportada por las binutils, existeun conjunto diferente de �x-up records disponibles. Por tanto, para que latécnica ISIS pueda ser transportada a otra arquitectura habría que replicar

43

los �x-up records creados para la generación de las �rmas. Sólo podrían re-utilizarse los ya creados si el procedimiento de cálculo se mantiene (lo cual síse cumple), y los campos de dirección se mantienen con la misma longitud yen la misma posición dentro de las �rmas de referencia.

Variant frags

Un variant frag es un fragmento de código de longitud variable. La variabi-lidad viene impuesta por la necesidad de generar un código óptimo unida aldesconocimiento de cuál es la secuencia óptima para un caso particular en elpreciso momento de la generación de dicho código.

Se podría optar por la vía más conservadora, generando aquella secuencia quesiempre es posible utilizar, aún a costa de resultar ine�ciente en la mayoríade los casos, en los que la tarea podría resolverse con menos instrucciones.

Lo que se pretende con un variant frag es retrasar la elección de cuál es lasecuencia óptima de instrucciones hasta tener toda la información necesariapara tomar la decisión de forma razonada. Dicho así, parecería que partedel proceso de ensamblado (la generación de algunas instrucciones) deberetrasarse hasta disponer de la información necesaria, quizá hasta el momentodel montaje. Pero esto es incoherente con respecto a una estructura por capasen la que cada herramienta se encarga de una parte del trabajo.

¾Cómo se solventa esta disyuntiva? Generando las diferentes alternativas po-sibles de secuencias de instrucciones, indicando para cada una de ellas quérequisitos necesita y cuántas instrucciones implica. De esta forma, la gene-ración de instrucciones se circunscribe al ensamblador, pero la elección decuál es la secuencia óptima puede retrasarse. Esto es posible porque, ge-neralmente, la secuencia que se considera óptima es la que necesita menosinstrucciones.

Pongamos un ejemplo, utilizando esta vez una instrucción de lectura de me-moria de la arquitectura MIPS. Para referenciar una posición de memoriaen esta arquitectura se usa lo que se conoce como direccionamiento inde-xado: un desplazamiento (con signo y de un tamaño máximo de 16 bits) apartir de una dirección denominada base (almacenada en un registro base).Tanto el desplazamiento como el registro base se explicitan en la instrucciónde acceso a memoria. El problema para el programa ensamblador es saber,en el momento de generar la secuencia de instrucciones de acceso a memo-ria, si existe un registro base disponible y cargado con una dirección base

44

tal que, mediante un desplazamiento de 16 bits, se pueda acceder al datocorrespondiente.

Si se trata de acceder a una variable local de la función que se está compi-lando, el registro base existe, sea el puntero de pila o el frame pointer, y elprograma ensamblador dispone de toda la información necesaria para gene-rar, habitualmente, una única instrucción (la de lectura de memoria) paraacceder al dato. Caso de que el desplazamiento no se pueda codi�car con 16bits, el ensamblador genera una secuencia diferente, alternativa, en la quese carga un registro con la dirección del dato en memoria y luego ese regis-tro se usa como registro base en la instrucción de lectura de memoria (condesplazamiento cero). Evidentemente, esta última secuencia requiere de másinstrucciones y el programa ensamblador, por razones de e�ciencia, intentaevitarla en la medida de lo posible.

Sin embargo, si el dato que se quiere cargar de memoria está referenciado deforma simbólica y con un nombre que el ensamblador no puede resolver enese momento, ¾qué se puede hacer? El programa ensamblador crea un variantfrag con las dos secuencias de instrucciones, cada una con los �x-up recordsnecesarios para que pueda ser completada.

Cuando el símbolo puede ser resuelto, se procede a realizar los cálculos decada �x-up record, comenzando por aquella secuencia de menor longitud. Siel cálculo no falla, se descarta la secuencia mayor. Si el cálculo falla, porque eldesplazamiento no puede ser codi�cado en el campo asociado de la instrucciónde lectura de memoria, la secuencia menor es descartada y se usa la mayor.

De esta manera, se consigue que la generación de instrucciones sea competen-cia exclusiva del programa ensamblador y que la secuencia de instruccionesutilizada para resolver cada tarea sea siempre la óptima, aún a pesar de queel ensamblador no disponga de toda la información que la elección correctarequiere.

Los variant frag no ayudan a la generación de las �rmas; antes bien, su usodi�culta el trabajo de la generación de las mismas.

En efecto, que un bloque de instrucciones incluya un variant frag implica,por ejemplo, que no se pueda saber a ciencia cierta cuál será el número deinstrucciones del mismo. Al mismo tiempo, un bloque de muchas instruccio-nes debe ser partido en varios de menor longitud para acomodarlos a lo quese permite en el campo longitud de las �rmas de referencia.

En este caso se ha optado por la solución más sencilla, aunque menos e�ciente:

45

de cada variant frag se ha tomado en consideración la secuencia más largapara determinar cuándo es necesario subdividir un bloque. El efecto negativoes que, cuando �nalmente se escoge la secuencia de menor longitud del variantfrag, la longitud total del bloque puede ser menor de la máxima; y si estose hubiese podido tener en cuenta quizá no habría sido necesario dividir elbloque en dos, una división que genera una sobrecarga tanto en prestacionescomo en consumo de memoria.

2.4.3. Inserción automática de las �rmas de referencia

Como se ha mencionado anteriormente, este desarrollo práctico utiliza la ver-sión 2.95 del compilador gcc y modi�ca las herramientas que lo acompañan,las binutils, en su versión 2.9.5. Se ha tomado como punto de partida laversión de este software como compilador cruzado, utilizando la arquitecturai386 sobre el sistema operativo linux como host y la arquitectura MIPS sobreun sistema bare o sin sistema operativo como target.

Puesto que este desarrollo es, sencillamente, una prueba de que la generaciónautomática de las �rmas requeridas por la técnica ISIS es viable, se ha res-tringido esta generación al formato binario elf (acrónimo de Executable andLinkable Format) en su versión big-endian.

La inserción de las �rmas de referencia no resulta compleja gracias al uso delos �x-up records. De hecho, cuando se inserta una �rma precediendo a unbloque de instrucciones, prácticamente todos los campos quedan vacíos a laespera de que la �rma sea �parcheada� con los �x-up records necesarios.

En efecto, en muchos casos no se puede establecer de forma de�nitiva lalongitud del bloque, debido al uso de variant frags. Los bits de los camposdirección origen y dirección destino necesitan resolver los símbolos corres-pondientes, y en cuanto a la �rma de respaldo de los patrones binarios de lasinstrucciones, es obvio que no puede ser calculada hasta que todos los bitsde todas las instrucciones han sido determinados; esto incluye la resoluciónde todos los �x-up records y variant frags que pudiera haber.

Esta dependencia ha forzado a crear un montador en dos pasadas. La primerapasada resuelve todos los símbolos del programa, excepto los referidos a las�rmas de referencia; hay un gran esfuerzo en el desarrollo del montador paraque éste pueda resolver su tarea en una única pasada sobre el código. En lasegunda pasada, y ya con todas las instrucciones en su forma de�nitiva, es

46

cuando se pueden completar todos los campos de las �rmas de referencia.

Para los �x-ups correspondientes a la dirección destino de los saltos, tantocondicionales como incondicionales, se utiliza el mismo símbolo referencia-do en la propia instrucción de salto. En el caso de los saltos con múltiplesdestinos, es necesario que en el momento de la inserción de la instrucción elprograma ensamblador tenga la información de la especial característica delsalto, y de todos los posibles destinos.

A �n de no tener que modi�car la parte de compilación de alto nivel, quehabría de generar dicha información bien de forma automática bien con elapoyo del programador, los únicos saltos de este tipo que se han incorporadoal soporte software son los retornos de procedimiento. Estos saltos múltiplesno requieren de ninguna información especí�ca desde el compilador, y es poresto que se han incluido en el soporte software.

Durante el ensamblado, cuando se insertan en el código objeto las últimasinstrucciones de un procedimiento, el programa ensamblador modi�cado ge-nera de forma automática y especí�ca para esta tarea un símbolo que incluyeel nombre del procedimiento. Este símbolo es utilizado como referencia parala codi�cación del campo dirección origen en la �rma de referencia del bloquesiguiente al de la instrucción de llamada a procedimiento (el que se ejectuatras el retorno).

Para que el �nal del procedimiento pueda ser utilizado como dirección dereferencia es necesario que, para cada procedimiento o función, exista unúnico punto de salida. Este es un requisito imprescindible, pues en la técnicaISIS se da por supuesto que, una vez se ejecuta un procedimiento, existe unaúnica instrucción con la que se retorna del mismo.

Este requisito no es nuevo, y de hecho no ha sido necesario forzarlo con lamodi�cación que aquí se describe, pues ya estaba incorporada en el código delensamblador para la generación del epílogo de la función. Este epílogo incluyeel código para restaurar los registros modi�cados por la función, recuperarde la pila el espacio de las variables locales, etc. Por simple economía, esnecesario que dicho epílogo no se repita aunque en el texto fuente de altonivel una función pueda retornar desde múltiples puntos del código; lo quehace el compilador es redireccionar esos retornos hacia un epílogo comúndesde el que efectivamente se realiza el retorno.

Sin embargo, si se realizara la programación directamente en ensamblador, elprogramador debe ser consciente de este requisito. De lo contrario, el procesa-

47

dor de guardia detectará como errores retornos correctamente ejecutados porel procesador principal, pero realizados desde puntos de la función distintode las últimas instrucciones.

Quizá queda por resaltar que, de hecho, los campos de longitud y �rmaderivada no requieren de ninguna clase de �x-up, pues basta con explorary procesar el código hasta encontrar la siguiente �rma de referencia paradeterminar su contenido.

2.4.4. Uso práctico del compilador

A partir del código fuente de las binutils modi�cadas, es necesario creardos juegos de librerías de soporte. Uno de ellos sin �rmas ni instruccionesinsertadas, que se usará en el caso de desear la generación de código estándarde la arquitectura MIPS.

El segundo juego de las librerías de soporte incluye las �rmas de referenciade los bloques, además de los �x-ups records necesarios para el parcheado delas �rmas en el momento de generar el código ejecutable.

Es contra este último juego de librerías que hay que compilar los programasde usuario si se desea que el programa, en su conjunto, incorpore las �rmasde referencia de la técnica ISIS.

La elección de qué librerias se utilizan, y de si se han de insertar o no las�rmas de referencia se explicita a través de opciones (switches) en la línea deórdenes en el momento de la compilación/montaje.

La posibilidad de compilar código sin incluir las �rmas de referencia permi-te utilizar este compilador para generar código estándar de la arquitecturaMIPS; esto puede ser aprovechado para utilizar el compilador de forma que laincorporación de las �rmas en el código de un sistema se realice de forma pro-gresiva. Como se verá en el capítulo 3, el procesador que se ha implementadopara hacer realidad la propuesta ISIS permite que, en tiempo de ejecución,se indique si la tarea en ejecución lleva o no las �rmas de referencia, y, porende, si el procesador de guardia puede veri�car su ejecución o no.

En caso de utilizar la compilación con la inserción de las �rmas de refe-rencia, durante el proceso se generan mensajes de información que indican,por ejemplo, cuántos bloques tenía el programa originalmente, cuántos hansido troceados, cuántas instrucciones y �rmas han sido insertadas en el pro-

48

grama, etc. Esta información será crucial para analizar la sobrecarga en losrequerimientos de memoria que el uso de ISIS impone.

2.5. Conclusiones

En este capítulo se han analizado los puntos débiles de las propuestas exis-tentes en materia de procesadores de guardia. De las conclusiones de dichoanálisis surge la propuesta de una nueva técnica de empotrado de �rmas dereferencia, a la que se ha denominado ISIS (acrónimo de Interleaved Signatu-re Instruction Stream). Con esta propuesta no es necesario modi�car el juegode instrucciones del procesador principal, ni se incurre en graves penalizacio-nes en las prestaciones �nales, debido a que el procesador principal no hacereferencia, ni ejecuta, dichas �rmas.

Para conseguir aislar las �rmas de referencia de las instrucciones del proce-sador principal, manteniendo estas �rmas intercaladas entre dicho código, esnecesario realizar algunas modi�caciones sencillas a la semántica, que no a lacodi�cación, de algunas instrucciones. Especí�camente, se ven afectadas porestos cambios semánticos las instrucciones de salto condicional y las de saltoa subrutina.

Este aislamiento entre el procesador principal y las �rmas de referencia per-sigue reducir la pérdida de prestaciones en la que incurren otras propuestasprevias de procesadores de guardia. Sin embargo, a veces es necesario in-terferir en el programa original para insertar instrucciones de salto. Estasinstrucciones no suponen sólo un incremento en los requisistos de memoria,también afectan a las prestaciones del sistema resultante.

La técnica ISIS no requiere de la arquitectura que se pretende monitorizarningún cambio en el juego de instrucciones ni ninguna característica especial,lo que permite su aplicación a cualquier arquitectura.

A partir de la técnica de empotrado de �rmas se desarrolla una �rma dereferencia que incorpora un conjunto de mecanismos de detección de errores;estos mecanismos están basados, fundamentalmente, en la utilización de bitsde respaldo tanto para las instrucciones que componen el bloque como paralos saltos entre bloques.

La implementación de uno de los mecanismos de detección de errores des-critos, el denominado inicio de bloque no es siempre posible. Para que su

49

uso sea factible es necesario que la codi�cación de las �rmas de referenciaesté completamente diferenciada de las instrucciones del procesador princi-pal; como uno de los objetivos de ISIS es que sea aplicable al mayor númerode arquitecturas posibles, esta codi�cación, que no siempre es posible puesdepende de los códigos de operación dentro del juego de instrucciones queno estén en uso, no es un requisito impuesto sino una recomendación. Si di-cha codi�cación no es posible la técnica de empotrado de �rmas ISIS puedeigualmente emplearse, aunque en las �rmas de referencia se perderá uno delos mecanismos de detección de los errores de los saltos del procesador.

De entre estos mecanismos, destaca por su originalidad la veri�cación de unsalto con múltiples destinos tras el salto, utilizando los bits de referencia delbloque destino. Esta es una solución, como se ha dicho, original, al proble-ma de la veri�cación de saltos con destino múltiple, que permite veri�carescenarios de salto que no habían sido contemplados previamente.

Aún quedan algunos casos o escenarios de salto que no pueden ser veri�cados:el caso de un bloque de instrucciones alcanzado desde varias instrucciones desalto con destino múltiple. Dicho con otras palabras, un bloque que es undestino compartido por varios de estos saltos.

Por último, y como demostración práctica de la viabilidad de la propuestaanterior, se ha descrito cómo partiendo de un compilador de código abiertoy utilizado ampliamente tanto en el entorno académico como el industrial,y mediante una serie de modi�caciones, se generan de forma automática las�rmas de referencia.

Con esta implementación del software de soporte se pueden compilar progra-mas escritos en lenguaje C, que posteriormente son instrumentados de formaautomática durante el ensamblado; en general, la información necesaria paracompletar la codi�cación de las �rmas de referencia no está disponible hastael momento de generar el programa ejecutable, momento en el que las �rmasson de�nitivamente codi�cadas.

Aunque el soporte software permite contemplar todos los escenarios de saltopropuestos, y para evitar la modi�cación de la parte de compilación en altonivel, los únicos saltos con múltiples destinos a los que se ha dado soportees a los retornos de procedimiento. Cualquier otro salto múltiple requeriríaque, bien el usuario, bien el compilador de alto nivel, informaran al programaensamblador de qué saltos tienen esta característica, y cuáles son los destinosposibles.

50

Lo que queda por hacer es, en primer lugar, demostrar de forma práctica quelas alteraciones semánticas propuestas para las instrucciones son factibles, yque un sistema así modi�cado es viable. En segundo lugar, demostrar que losmecanismos de detección de errores incorporados son e�caces. Y, por último,determinar la magnitud de la sobrecarga, tanto en consumo de memoria comoen prestaciones, en la que se incurre por utilizar esta técnica.

El capítulo 3 se dedica al primero de estos trabajos: describir el desarrollo deun procesador, un modelo de la arquitectura MIPS desarrollado utilizandoel lenguaje de descripción de hardware VHDL, dotado de un procesador deguardia y modi�cado para utilizar la técnica ISIS.

51

Capı́tulo 3HORUS: Implementación de la técnicaISIS

To succeed in the commodity market, fault-tolerant tech-niques need to be sought which will be transparent to endusers.

D. P. Siewiorek [7]

3.1. Introducción

En el capítulo 2 se ha propuesto una técnica para empotrar las �rmas dereferencia de un procesador de guardia, intercalándolas entre los bloques deinstrucciones del procesador principal.

Para minimizar la pérdida de prestaciones en la que se incurre por la mo-nitorización, la técnica ISIS propone aislar completamente las secuencias deinstrucciones del procesador principal y las �rmas de referencia. Para con-seguir este aislamiento, es necesario realizar una serie de modi�caciones alcomportamiento del procesador en lo que al tratamiento de saltos condicio-nales y llamadas a procedimiento se re�ere.

En este capítulo se presenta una implementación de dicha propuesta sobreun procesador RISC, utilizando como base la arquitectura MIPS, para la que

52

se dispone de soporte software para el cálculo y la generación automática delas �rmas de referencia.

Este desarrollo persigue, como principal objetivo, demostrar que las modi�-caciones propuestas son viables. También ha de servir como banco de trabajopara obtener índices de prestaciones sobre programas de ejemplo.

A este procesador se le ha dado el nombre HORUS1. Se trata de un mo-delo desarrollado utilizando el lenguaje de descripción de hardware VHDL,utilizando un subconjunto de elementos del lenguaje que permite que el re-sultado sea sintetizable, listo para ser descargado sobre un dispositivo lógicoprogramable tipo FPGA.

3.2. Banco de pruebas

Para poder realizar las pruebas necesarias que permitan la veri�cación delfuncionamiento de HORUS es necesario construir un banco de pruebas otestbench que debe incluir, al menos, el propio sistema a veri�car junto conlos elementos necesarios que soportan la ejecución de programas de éste. Asaber: la memoria donde residirán tanto las instrucciones del programa enejecución como los datos que se manejan, y el sistema de reloj que indica elpaso del tiempo para todo el sistema.

Los modelos desarrollados para estos elementos no son modelos sintetizables,ni es necesario que lo sean. Se trata de dotar al modelo de HORUS de loselementos de soporte necesarios para que pueda ejecutar un programa, y deesta forma veri�car que su funcionamiento es correcto.

En particular, el modelo desarrollado en VHDL para la memoria del sistemapermite indicar, en el momento de la simulación, el nombre de un �cherode texto que contiene, en un formato estándar, el contenido inicial de lamemoria antes de que el procesador inicie la ejecución del programa. De estamanera dicha memoria permite modelar una memoria de tipo ROM con lasinstrucciones del programa a ejecutar, cuyo contenido ha sido programadode forma previa a la ejecución como ocurre de típicamente en los sistemasempotrados.

1Puesto que este procesador deriva de la técnica ISIS (podríamos decir que es frutode la misma), nombre que coincide con el nombre de una antigua divinidad egipcia, elnombre del procesador pretende seguir la genealogía de la antigua mitología egipcia, en laque HORUS es hijo de los dioses ISIS y OSIRIS.

53

El nombre del �chero se indica mediante lo que en el lenguaje VHDL se conocecomo genérico. Un génerico en VHDL es una constante utilizada durante lacompilación y/o simulación del modelo en VHDL y que permite modi�car laestructura interna o la lógica del propio modelo en función del valor indicadopor dicha constante. Un ejemplo prototípico sería la especi�cación, medianteun genérico, del número de bits de un registro; el uso de un genérico permitecrear un registro parametrizable, en el que se establece el número de bitsefectivo cuando el registro se incorpora en un modelo de mayor envergardura(se dice, entonces, que es instanciado).

Al genérico que permite dar nombre al �chero con el contenido inicial dela memoria se le ha llamado filename. El formato elegido para represen-tar dicho contenido es el formato estándar intel-hexadecimal, utilizado habi-tualmente para la representación de contenidos de memorias (especialmentecuando se trabaja con programadores de pastillas de memoria ROM). Se tra-ta de un formato orientado al procesamiento por líneas y en el que direccionesy contenidos están representados en hexadecimal por texto ASCII legible, loque permite que el desarrollo de un traductor en VHDL (o en cualquier len-guaje de programación de alto nivel) no resulte demasiado farragoso, puesademás de manejar texto el número de registros distintos (tipos distintos delíneas de texto) a tratar es escaso.

La elección de este formato se asienta, además de su simplicidad, en que estáentre los formatos soportados por las herramientas binutils del compila-dor gcc que permiten la transformación de un �chero objeto de uno a otroformato.

Así, tras el montaje del programa, en el que se obtiene un �chero binario conel formato por defecto (elf), basta con invocar a la utilidad objcopy paratransformar el ejecutable al formato intel-hexadecimal. De esta forma, sepuede también, mediante la simulación de la ejecución del programa, veri�carque la inserción de �rmas se ha realizado correctamente, validando así lasmodi�caciones al compilador estándar.

También con el objeto de facilitar la veri�cación del modelo se ha incluidoun sistema de traza, directamente conectado al pipeline del procesador, quegenera la información más relevante de la instrucción que ocupa cada eta-pa. Puesto que este elemento depurador no es sintetizable pero se encuentradirectamente conectado con el procesador principal, su posición en la organi-zación jerárquica del modelo es bastante bastante peculiar, ya que al no sersintetizable debería quedar fuera de los límites del modelo que representa aldispositivo lógico programable (FPGA o similar).

54

Para solventar este aspecto en el modelo del sistema se ha utilizado otrogenérico, al que se ha dado el nombre debug, para que el sistema de traza estépresente o no en el modelo. Es interesante que esté presente cuando se simulael sistema, pues ayuda a veri�car su funcionamiento, pero evidentemente nopuede estar presente cuando se pretende sintetizar el modelo para descargarlosobre un chip real, pues el código no es sintetizable.

Por último, se ha utilizado un tercer genérico llamado isis que permite eli-minar del sistema todo rastro del procesador de guardia: el procesador deguardia en sí, el segundo puerto de lectura de la memoria caché de instruc-ciones, las conexiones con el procesador principal, los registros asociados delcoprocesador de control del sistema, etc.

Con este último genérico se van a poder realizar de forma sencilla los análisisde coste, en términos de área de silicio, de esta implementación al utilizar elprocesador de guardia.

Es importante resaltar que el efecto de este genérico no tiene nada que ver conla modi�cación del contenido del registro de sistema que permite, en tiempode ejecución, activar o desactivar la veri�cación de �rmas de cada una de lastareas del sistema. Mientras que este último permite una elección realizadapor el software del propio sistema, durante la ejecución de un programa,y que requiere la presencia del procesador de guardia para ser efectivo, elgenérico lo que hace es eliminar todo rastro del procesador de guardia delsistema, siendo entonces imposible su activación/desactivación por medio desoftware.

En de�nitiva, la declaración del testbench en VHDL del sistema, entidad ala que se ha dado el nombre Test_SystemOnChip queda, con los genéricosque se han comentado anteriormente, como

ENTITY Test_SystemOnChip IS

GENERIC (

filename: STRING := "./memory.hex";

debug: BOOLEAN := FALSE;

isis: BOOLEAN := FALSE );

END Test_SystemOnChip;

En esta declaración los valores asociados a los genéricos son sus valores pordefecto. Estos son los valores que se utilizan cuando se sintetiza el sistema;cuando se realiza una simulación, el valor de cada parámetro puede ser alte-rado sin necesidad de recompilar el código fuente; caso de no especi�car en

55

la línea de órdenes un valor concreto para un genérico, se utiliza el valor pordefecto.

Así, por ejemplo, para invocar la simulación del testbench desde la consoladel simulador comercial modelsim, se puede hacer con la siguiente orden (elorden de los genéricos en la línea de órdenes no es importante)

vsim -Gfilename="./test/programas/memoria.hex"

-Gdebug=true -Gisis=true Test_SystemOnChip

donde vsim es la orden que invoca al simulador, -G indica que lo que siguea continuación es la modi�cación de un genérico de la entidad a simularutilizando la sintaxis generico=valor y Test_SystemOnChip es la entidad asimular; en este caso particular se ha modi�cado el valor de todos los genéricosdel testbench respecto de su valor por defecto.

Figura 3.1: Organización del banco de pruebas

3.2.1. Organización del banco de pruebas

En la �gura 3.1 se muestra la organización interna del banco de pruebas ylos componentes que lo forman. En el mismo se puede observar también lain�uencia de cada genérico sobre los diferentes componentes.

Tanto el bus de direcciones como el de datos son de 32 bits. En el accesoa los datos de la memoria del sistema, cada bytelane puede ser activadoindividualmente, lo que permite realizar escrituras de datos con un tamañoinferior a 32 bits.

Como se puede observar por la interfaz de la memoria, ésta se ha modeladocomo una memoria estática de tipo RAM estándar. En este tipo de memorias,

56

y puesto que no existe una línea de reloj que sincronice las transferencias,la única garantía de que las operaciones se llevan a cabo correctamente esveri�cando, mediante el análisis temporal correspondiente, que el sistemacumple todos los requisitos temporales exigidos por la memoria: ancho depulso de las señales de control, tiempos de establecimiento y mantenimientode los datos, etc.

Puesto que de la memoria se ha realizado un modelo funcional y no con elnivel de detalle necesario como para incluir estos tiempos, en un sistema realla frecuencia del sistema debería ser lo su�cientemente baja como para quelas operaciones de lectura y escritura puedan realizarse en un ciclo de reloj.

Evidentemente, si la memoria del sistema real no fuera del tipo RAM estática,dentro del sistema habría que modi�car la parte correspondiente a la interfazcon la memoria externa. Puesto que dicho interfaz se conecta a un bus internoestandarizado (denominado AMBA, véase la sección 3.3), el resto del sistemapodría mantenerse intacto.

3.3. Arquitectura del sistema

El sistema (el System-On-a-Chip) diseñado tiene como núcleo principal alprocesador HORUS, un procesador RISC con un pipeline de diseño clásicoal que se ha añadido un procesador de guardia para veri�car la integridad delos programas ejecutados.

Además del procesador de guardia, y como todo procesador de la arquitec-tura MIPS, se ha incorporado un coprocesador para el control del sistema(el denominado System Control Coprocessor o Coprocesador 0 en la nomen-clatura MIPS). En éste se encuentran los registros de control de la unidadde gestión de memoria (MMU, Memory Management Unit) y la gestión delas interrupciones y excepciones, entre otros registros; en algunos modelos dela arquitectura MIPS se ubican en este coprocesador los registros asociadosal sistema de depuración, de índices de prestaciones, control de las memo-rias caché, etc. En el caso de HORUS, el coprocesador 0 permite el acceso alos registros del coprocesador de guardia para el correcto funcionamiento delsistema durante el procesamiento de las excepciones.

A este bloque se le ha añadido una memoria caché de instrucciones con dospuertos de lectura. Este acceso dual es necesario para permitir que el procesa-dor de guardia obtenga las �rmas de referencia de los bloques ejecutados por

57

el procesador principal sin perturbar la búsqueda de instrucciones de éste.

Para mantener la independencia del diseño de la caché de instrucciones res-pecto de los diferentes tipos de memorias con los que se podría dotar alsistema, la conexión a las mismas se realiza a través de un bus estándar de-nominado AMBA. El bus AMBA es un bus de altas prestaciones diseñadopor la compañía ARM para la interconexión de diferentes elementos dentrode una pastilla; dicho bus se contempla la posibilidad de que existan variosmaestros (o iniciadores de transferencias) conectados mediante el bus a uno omás esclavos. En la especi�cación de AMBA se distinguen dos segmentos: elsegmento de altas prestaciones AHB (AMBA High-speed Bus) y el segmen-to de periféricos, de menor velocidad, denominado APB (AMBA PeripheralBus). Ambos segmentos se conectan entre sí mediante un puente AHB-APBque permite a los maestros conectados al segmento AHB hacer uso de losperiféricos del segmento APB.

La caché de instrucciones dispone de un controlador que, en caso de fallo porparte del procesador principal o el de guardia, rellena los contenidos de lamemoria caché accediendo a la memoria del sistema a través del bus AMBA.Es, por tanto, un controlador de memoria caché y un maestro en el busAMBA, lo que le permite ser el iniciador de las transferencias con memoria.En caso de que existan peticiones simultáneas de rellenado de la caché porparte de los dos procesadores, el acceso es serializado por este controlador,siendo siempre promocionado el procesador principal.

El acceso de HORUS a la memoria de datos se realiza a través de un sencillomaestro de bus AMBA, sin que exista en esta caso memoria caché alguna.No se ha incluido una memoria caché de datos por dos motivos: por un ladono resulta relevante para demostrar la viabilidad del trabajo que se proponeen esta tesis, pues en nada se ven afectados los datos ni su tratamiento conla incorporación de un procesador de guardia; por otro lado, al no disponerde memoria caché de datos se incrementa el trá�co con la memoria externa(y, por tanto, a través del bus AMBA), lo que permite estresar fácilmente elsistema para determinar mejor el impacto que sobre las prestaciones tiene eluso del procesador de guuardia.

Para la gestión del bus AMBA se ha diseñado un árbitro de bus siguiendo lasespeci�caciones de la compañía ARM. Este árbitro serializa el acceso de losdiferentes maestros al bus, y un segundo elemento central denominado con-trolador de esclavos permite seleccionar a un esclavo de entre los conectadosal bus como el destinatario de la transferencia de memoria. Este controladordetermina el esclavo implicado en la transferencia en función de la direc-

58

ción de memoria utilizada por el maestro, según una tabla �jada para cadasistema en particular.

En el caso que nos ocupa, el único esclavo AMBA diseñado es el controladorde memoria que da acceso a la memoria externa.

Además de estos elementos también se ha diseñado un puente AMBA AHB-APB, entre el segmento de altas prestaciones AHB y el de menor capacidadde transferencia APB lo que permitirá, en un futuro, dotar al sistema deperiféricos típicos de un sistema empotrado (comunicaciones, temporizadores,etc) bien mediante el uso de diseños propios bien mediante la implantación deperiféricos diseñados por terceros para su conexión a un bus APB estándar.

Tanto el árbitro de bus como el controlador de esclavos se ha diseñado pen-sando en un sistema de mayor envergadura, con más de dos maestros y unesclavo. Esto permite de forma sencilla y directa, por ejemplo, que variosprocesadores HORUS compartan el bus AMBA y los esclavos que a este busse conecten, creando un sistema multiprocesador.

59

Parte II:

Publicaciones

A Watchdog Processor Architecture with Minimal

Performance Overhead

Francisco Rodríguez, José Carlos Campelo, and Juan José Serrano

Grupo de Sistemas Tolerantes a Fallos - Fault Tolerant Systems Group,Departamento de Informática de Sistemas y Computadoras,Universidad Politécnica de Valencia, 46022-Valencia (Spain),

{prodrig, jcampelo, jserrano}@disca.upv.es,WWW home page: http://www.disca.upv.es/gstf

Abstract

Control �ow monitoring using a watchdogprocessor is a well-known technique to in-crease the dependability of a microproces-sor system. Most approaches embed refer-ence signatures for the watchdog processorinto the processor instruction stream cre-ating noticeable memory and performanceoverheads. A novel watchdog processorarchitecture using embedded signatures ispresented that minimizes the memory over-head and nulli�es performance penalty onthe main processor without sacri�cing errordetection coverage or latency. This schemeis called Interleaved Signature InstructionStream (ISIS) in order to re�ect the factthat signatures and main instructions aretwo independent streams that co-exist inthe system.

1. Introduction

In the �Model for the Future� foreseen byAvizienis in [1] the urgent need to incorpo-rate dependability to every day computingis clear: �Yet, it is alarming to observe thatthe explosive growth of complexity, speed,and performance of single-chip processorshas not been paralleled by the inclusion ofmore on-chip error detection and recoveryfeatures�.

E�cient error detection is of fundamentalimportance in dependable computing sys-tems. As the vast majority of faults aretransient, the use of a concurrent Error De-tection Mechanism (EDM) is of utmost in-terest as high coverage and low detection la-tency characteristics are needed to recoverthe system from the error. And as experi-ments demonstrate [2, 3, 4, 5], a high per-centage of non-overwritten errors results incontrol �ow errors.

61

Siewiorek states in [6] that �To succeed inthe commodity market, faul-tolerant tech-niques need to be sought which will betransparent to end users�. A fault-toleranttechnique can be considered transparentonly if results in minimal performance over-head in silicon, memory size or processorspeed.

Although redundant systems can achievethe best degree of fault-tolerance, the highoverheads implied limit their applicabilityin every day computing elements.

The work presented here provides concur-rent detection of control �ow errors withno performance penalty and minimal mem-ory and silicon sizes. No modi�cations areneeded in the instruction set of the proces-sor used as testbed and the architecturalones are so small that they can be enabledand disabled under software control to allowbinary compatibility with existing software.The watchdog processor is very simple, andits design can be applied to other processorsas well.

The paper is structured as follows: Thenext section is devoted to present a set ofbasic de�nitions and it is followed by theoutline of related works in the literature.Section 4 presents the system architecturewhere the watchdog is embedded. Section5 discusses error detection capabilities, sig-nature characteristics and placement, andmodi�cations needed into the original ar-chitecture of the processor. A memory over-head comparison with similar work is per-formed afterwards, to �nish with the con-clusions.

2. Basic De�nitions

The following de�nitions are taken from [5]:

1. A branch instruction is an instructionthat can break the sequential �ow ofexecution like a procedure call, a condi-tional jump or a return-from-procedureinstruction.

2. A branch-in point is an instructionused as the destination of a branch in-struction or the entry point of, for ex-ample, an interrupt handler.

3. A program is partitioned into branch-free intervals and branch instructions.The beginning of a branch-free inter-val is a branch-in instruction or theinstruction following a branch. Abranch-free interval is ended by abranch or a branch-in instruction.

4. A basic block is only a branch-free in-terval if it is ended by a branch-in. Itis the branch-free interval and its fol-lowing branch instruction otherwise.

With the de�nitions above a program canbe represented by a Control Flow Graph(CFG). Vertices in this graph are used torepresent basic blocks and directed arcsare used to represent legal paths betweenblocks. Figure 1 shows some examples forsimple High Level Language constructs.

We call block fall-through to the situa-tion where two basic blocks are separatedwith no branch-out instruction in between.Blocks are divided only because the �rst

62

Figure 1: CFGs for some HLL constructs

branch-free interval is ended by a followingbranch-in instruction that starts the secondblock.

In [7] a block that receives more than twotransfers of control �ow it is said to be abranch fan-in block. We distinguish wetherthe control �ow transfer is due to a non-taken conditional branch (that is, bothblocks are contiguos in memory) and saythat a multiple fan-in block is reachablefrom more than one out-of-sequence vertexin the CFG.

A branch instruction with more than oneout-of-sequence target is represented in theCFG by two or more arcs departing fromthe same vertex, where at least two ofthem are targeted to out-of-sequence ver-tices. These are said to be multiple fan-

out blocks.

A derived signature is a value assignedto each instruction block. The term derivedmeans the signature is not an arbitrarily as-signed value but calculated from the block'sinstructions. Derived signatures are usuallyobtained xoring the instruction opcodes orusing such opcodes to fed a Linear Feed-Back Shift Register (LFSR). These values

are calculated at compile time and used asreference by the EDM to verify correctnessof executed instructions.

If signatures are interspersed or hashedwith the processor instructions the methodis generally known as Embedded SignatureMonitoring (ESM). A watchdog proces-

sor is a hardware EDM used to detect Con-trol Flow Errors (CFE) and/or corruptionof the instructions executed by the proces-sor, usually employing derived signaturesand an ESM technique. In this case itperforms signature calculations from theinstruction opcodes that are actually exe-cuted by the main processor, checking theserun-time values against their references. Ifany di�erence is found the error in the mainprocessor instruction stream is detected andan Error Recovery Mechanism (ERM) is ac-tivated.

3. Related work

Several hardware approaches using a watch-dog processor and derived signatures forconcurrent error detection have been pro-posed. The most relevant works are out-

63

lined below:

Ohlsson et al. present in [5] a watchdogprocessor built into a RISC processor. Aspecialized tst instruction is inserted in thedelay slot of every branch instruction, test-ing the signature of the preceding block. Aninstruction counter is also used to time-outan instruction sequence when a branch in-struction is not executed in the speci�edrange. Other watchdog supporting instruc-tions are added to the processor instructionset to save and restore the value of the in-struction counter on procedure calls.

The watchdog processor used by Galla etal. in [8] to verify correct execution ofa communications controller of the Time-Triggered Architecture uses a similar ap-proach. A check instruction is inserted inappropiate places to trigger the checkingprocess with the reference signature that isstored in the subsequent word. In the caseof a branch, the branch delay slot is usedto insert an adjustment value for the signa-ture to ensure the run-time signature is thesame at the check instruction independentof the path followed. An instruction counteris also used by the watchdog. The counteris loaded during the check instruction anddecremented for every instruction executed;a time-out is issued if the counter reacheszero before a new check instruction is exe-cuted. Due to the nature of the communi-cations architecture, no interrupts are pro-cessed by the controller. Thus saving therun-time signature or instruction counter isnot necessary.

The ERC32 is a SPARC processor aug-mented with parity bits and a program �ow

control mechanism presented by Gaisler in[9]. In the ERC32 a test instruction to ver-ify the processor control �ow is also insertedin the delay slot of every branch to verifythe instruction bits of the preceding block.In his work, the test instruction is a slightlymodi�ed version of the original nop instruc-tion and no other modi�cations to the in-struction set is needed.

A di�erent error detection mechanism ispresented by Kim and Somani in [10]. Thedecoded signals for the pipeline control arechecked in a per instruction basis and theirreferences are retrieved from a watchdogprivate cache. If the run-time signature of agiven instruction can't be checked becauseits reference counterpart is not found in thecache, it is stored in the cache and used asreference for future executions. No signa-tures or program modi�cations are neededbecause reference signatures are generatedat run-time, thus creating no overhead.The drawback in this approach is that thewatchdog processor can't check all instruc-tions. An instruction can be checked if ithas been previously executed and only ifits reference has not been displaced fromthe watchdog private cache to store othersignatures. Although the error is detectedbefore the instruction is committed and nooverheads are created, the error coverage ispoor.

More recently, hardware additions to mod-ern processor architectures have been pro-posed to re-execute instructions and per-form a comparison to verify no errors havebeen produced before instructions are com-mitted.

64

Some of these proposals are outlined be-low for the sake of completeness but theyare out of the scope of this work because:i) Hardware additions, spare componentsand/or time redundancy are used to detectall possible errors by re-execution of all in-structions. Not only errors in the instruc-tion bits or execution �ow are detected butdata errors as well. ii) They require eithera complete redesign of the processor controlunit or the addition of a complete executionunit capable to carry out the same set of in-structions of the main processor, althoughits control unit can be simpler.

These include, to name a few:

REESE (Nickel and Somani, [11]) andAR-SMT (Rotenberg, [12]). Bothworks take advantage of the simulta-neous multi-threading architecture toexecute every instruction twice. Theinstructions of the �rst thread, alongwith their operands and results arestored in a queue (a delay bu�er inRotenberg's work) and re-executed.Results of both executions are com-pared before the instructions are com-mitted.

The micropocessor design approach ofWeaver and Austin in [13] to achivefault tolerance is the substitution of thecommittment stage of a pipeline pro-cessor with a checker processor. In-structions along with their inputs, ad-dresses and the results obtained arepassed to the checker processor whereinstructions are re-executed and resultscan be veri�ed before they are commit-ted.

The O3RS design of Mendelson andSuri in [14] and a modi�ed multiscalararchitecture used by Rashid et al. in[15] use spare components in a proces-sor capable of issuing more than oneinstruction per cycle to re-execute in-structions.

4. System Architecture

The system (see Fig. 2) is built around asoft-core of a MIPS R3000 processor clonedeveloped in synthesizeable VHDL [16]. Itis a 5-stage pipelined RISC processor run-ning the MIPS-I and MIPS-II InstructionSet Architecture [17]. Instruction and databus interfaces are designed as AMBA [18]AHB bus masters providing external mem-ory access.

This processor is provided with a MemoryManagement Unit (MMU) inside the Sys-tem Control Coprocessor (CP0 in the MIPSnomenclature) to perform virtual to physi-cal address mapping, isolating memory ar-eas of di�erent processes and checking cor-rect alignment of memory references.

To minimise performance penalty, the in-struction cache is designed with two readports that can provide two instructions si-multaneously, one for each processor. On acache hit, no interference exists even if theother processor is waiting for a cache linere�ll because of a cache miss.

To reduce the instruction cache complexitya single write port is provided that must beshared by both processors. When simulta-

65

Figure 2: System architecture

neous cache misses happen, cache re�lls areserved in a First-Come First-Served fash-ion. If they happen in the same clock cycle,the main processor is promoted.

This arrangement takes advantage of spacelocality in the application program to aug-ment cache hits for signatures. As we usean ESM technique and signatures are in-terleaved with processor instructions, whenboth processors produce a cache miss theyrequest the same memory block most of thetimes, as both reference words in the sameprogram area.

No modi�cation is needed in the processorinstruction set due to the fact that signa-ture instructions are neither fetched nor ex-ecuted by the main processor. This allowsus to maintain binary compatibility withexisting software. If access to the sourcecode is not possible, the program can berun without modi�cation (and no concur-rent �ow error detection capability will beprovided). This is possible because thewatchdog processor and processor's modi-�ed architecture can be enabled and dis-

abled under software control running withsuperuser privileges. If these features aredisabled, our processor behaves as an o�-the-shelf MIPS processor. Thus, if binarycompatibility is needed for a given task,these features must be disabled by the OSevery time the task resumes execution.

The watchdog processor is fed with the in-structions from the main processor pipelineas they are retired. When these instruc-tions enter the watchdog the run-time sig-natures and address parity bits are calcu-lated at the same rate of the arrived in-structions. When a block ends, these valuesare stored in a FIFO memory to decouplethe signature checking process. This FIFOallows a large set of instructions to be re-tired from the pipeline while the watchdogis waiting for a cache re�ll in order to geta reference signature instruction. In a sim-ilar way, the FIFO can be emptied by thewatchdog while the main processor pipelineis stalled due to a memory operation. Whenthis memory is full, the pipeline if forced towait for the watchdog checking process toread some data from the FIFO.

66

Figure 3: Block signature encoding

5. Interleaved Signature

Instruction Stream

Block signatures are placed at the begin-ning of every basic block in our scheme.These reference signatures are used by thewatchdog processor only and not processedin any way by the main processor.

Two completely independent, interleavedinstruction streams coexist in our system:the application instruction stream which isdivided into blocks and executed by themain processor and the signature streamused by the watchdog processor. Wehave called Interleaved Signature Instruc-tion Stream (ISIS) to our technique due tothis fact.

The signature word (see Fig. 3 for a �elddescription) provide enough information tothe watchdog processor to check the follow-ing block properties:

1. Block length. The watchdog proces-sor checks a block's signature when thelast instruction of the block is retiredfrom the processor pipeline. Instead ofrelying on the branch instruction at theend of the block to perform the signa-ture checking, the watchdog counts theinstructions as they are retired. In thisway, the watchdog can anticipate whenthe last instruction comes and detect aCFE if a branch occurs too early or too

late.

2. Block signature. The block instruc-tions are compacted using a 16-bitLFSR that will be used by the watch-dog to verify that the correct instruc-tions have been retired from the pro-cessor pipeline.

3. Block Target Address. In the caseof a non multiple fan-out block witha target address that can be deter-mined at compile-time, a 3-bit par-ity signature is computed from theaddress di�erence between the branchand the out-of-sequence target instruc-tion. These parity bits are used atrun-time to provide some con�dence inthat the instruction reached after thebranch is the correct one.

4. Block Origin Address. When thebranch of a multiple fan-out block isexecuted, the watchdog can't checkall possible destinations even if theyare obtainable at compile time. Inour scheme, every possible destinationblock is provided with a 3-bit paritysignature of the address di�erence be-tween the originating branch and thestart of the block, much the sameas the previous Block Target Addresscheck. Thus, instead of checking thatthe target instruction is the correctone, the watchdog processor checks (atthe target block) that the originatingbranch is the correct one in this case.

67

Figure 4: Example of an address checking uncovered case

The signature instruction encoding hasbeen designed in such a way that a mainprocessor instruction can not be misin-trepreted as a watchdog signature instruc-tion. This provides an additional checkwhen a branch instruction is executed bythe main processor. This check consistsin the requirement to �nd a signature in-struction immediately preceding the �rstinstruction of every block. This also helpsto detect a CFE if a branch erroneouslyreaches a signature instruction, because theused encoding will force an illegal instruc-tion exception to be raised.

Furthermore, the block type helps thewatchdog processor to check wether the ex-ecution �ow is correct. For example, in thecase of a multiple fan-out block the blocktype re�ects the need to check the addresssignature at the target block. Even if anincorrect branch is taken to the initial in-struction of a block, target's signature in-struction must have coded in its type that itis a block where the origin address must bechecked or a CFE exception will be raised.

Instructions in the MIPS processor must beplaced at word boundaries; a memory align-ment exception is raised if this requirementis not met. Taking advantage of this mecha-

nism, the watchdog processor computes ad-dress di�erences as 30-bit values. Giventhat the branch instruction type used mostof the time by the compiler use a 16-bito�set to reach the target instruction thesedi�erences obtained at run-time for BlockTarget Address and Block Origin Addresschecks are usually half empty, so every par-ity bit protects 5 (10 in the worst case) ofsuch bits.

To our knowledge, the Block Origin Ad-dress checking has never been proposed inthe literature. The solutions o�ered so farto manage jumps with multiple targets usejustifying signatures (see [7] for an example)to patch the run-time signature and delaythe check process until a common branch-in point is encountered, increasing the errordetection latency.

Not all jumps can be covered with addresschecking however. Neither the jumps withrun-time computed addresses nor thosejumps to a multiple fan-in block that it isshared by several multiple fan-out blocks(see Fig. 4 for an example). In the latercase, an address signature per origin shouldbe used in the fan-in block, which is not pos-sible. Currently, only Block Origin Addresschecks from non multiple fan-out blocks can

68

Figure 5: An if-then-else example (a). After block signatures and jump insertion (b)

be covered for such shared blocks.

5.1. Processor Architecture

Modi�cations

Isolating the reference signatures from theinstructions fed into the processor pipelineresults in a minimal performance overheadin the application program. Sligth architec-ture modi�cations are needed in the mainprocessor in order to achieve it.

First of all, when a conditional branch in-struction ends a basic block, a second blockfollows immediately. The second block'ssignature sits between them, and the mainprocessor must skip it. In order to e�ec-tively jumping over the signature, the signa-ture size is added to the Program Counterif the branch is not taken.

In the same way, when a procedure call in-struction ends a basic block the next one tobe executed after the procedure returns im-mediately follows the �rst one. Again, thesecond block's signature must be taken intoaccount when calculating the procedure re-turn address. And again, this is achievedby an automatic addition of the signaturesize to the PC.

Additions to the PC mentioned above canbe automatically generated at run-time be-cause the control unit decodes a branch orprocedure call instruction at the end of theblock. The instruction is a clear indicationthat the block end will arrive soon. As theprocessor has a pipelined architecture, thenext instruction is executed in all cases (thisis known as the branch delay slot), so thecontrol unit has a clock cycle to prepare forthe addtion. Despite the fact that the in-struction in the delay slot is placed afterthe branch, it logically belongs to the sameblock, as it is executed even if the branch istaken.

However, in the case of a block fall-throughthe control unit has no clue to determinewhen the �rst block ends, so the signa-ture can not be automatically jumped over.In this case, the compiler explicitly addsan unconditional jump to skip it. This isthe only case where a processor instructionmust be added in order to isolate main pro-cessor from the signature stream. Figure 5ashows an example of an if-then-else con-struct with a fall-through block that needssuch an addition (shown in Fig. 5b).

69

6. Overhead analysis

Although we have not enough experimentaldata yet to assess the memory and perfor-mance overhead of our system, a qualitativeanalysis for the memory overhead based onrelated work is possible.

A purely software approach to concurrenterror detection was evaluated by Wildnerin [19]. This control �ow EDM is calledCompiler-Assisted Self Checking of Struc-tural Integrity (CASC) and it is based onaddress hashing and signature justifying toprotect the return address of every proce-dure. At the procedure entry, the returnaddress of the procedure is extracted fromthe link register into a general-purpose reg-ister to be operated on. The �rst operationis the inversion of the LSB bit of the re-turn address to provide a misaligment ex-ception in the case of a CFE. An add in-struction at each basic block is inserted tojustify the procedure signature and, at theexit point, the �nal justifying and reinver-sion of the LSB bit is calculated and theresult is transferred to the link register be-fore returning from the procedure. In thecase of a CFE, the return address obtainedis likely to cause a misaligment exceptionthus catching the error. The experimentscarried out on a RISC SPARC processor re-sulted in a memory codesize overhead forthe SPECint92 benchmarks varying from0% to 28% (18,76% on average) dependingon the run-time library used and the bench-mark itself.

The hardware watchdog of Ohlsson et al.presented in [5] use a tst instruction per

basic block, taking advantage of the branchdelay slot of a pipelined RISC processorcalled TRIP. One of the detection mecha-nisms used by the watchdog is an instruc-tion counter to issue a time-out exception ifa branch instruction is not executed duringthe speci�ed interval. When a procedure iscalled two instructions are inserted to savethe block instruction counter and anotherinstrucion is inserted at the procedure endto restore it. Their watchdog code size over-head is evaluated to be between 13% and25%. The later value comes from the heapsort algorithm showing a mean basic blockof 4.8 instructions.

ISIS inserts a single word per basic block,without special treatment for procedure en-try and exit blocks, so CASC or TRIP over-head can be taken as an upper bound ofISIS memory overhead.

Hennessey and Patterson in [20] state thatthe average length of a basic block for aRISC processor sits between 7 and 8 in-structions. The reasoning to evaluate mem-ory overhead as 1/L being L the basic blocklength is used by Ohlsson and Rimén in [21]to evaluate the memory overhead of theirImplicit Signature Checking (ISC) method.The same value (7-8 instructions per block)is used by Shirvani and McCluskey in [22] toperform this same analysis on several soft-ware signature checking techniques.

Applying this evaluation method to ISIS re-sults in a mean of about 12% - 15% mem-ory overhead. An additional word must beaccounted to eliminate fall-through blocks.The overhead of these insertions has tobe methodically studied, but initial exper-

70

iments show a negligible impact on overallmemory overhead.

7. Conclusion

We have presented a novel technique to em-bed signatures into the execution �ow of aRISC processor that provides a set of errorchecking procedures to assess that the �owof executed instructions is correct. Thesechecking procedures include a block lengthcount, the signature of instruction opcodesusing a LFSR, and address checking whena branch is executed. All these checkingsare performed in a per block basis, in orderto reduce the error detection latency of ourhardware Error Detection Mechanism.

One of those address checking procedureshas not been published before. It is theBlock Origin Address checking used when abranch has multiple valid targets and con-sists of delaying the branch checking untilthe target instruction is reached and veri-fying that the branch comes from the cor-rect origin vertex in the CFG. This tech-nique solves the address checking problemthat arises if a branch has multiple validdestinations, for example, the table-basedjumps used when the OS dispatchs a ser-vice request.

Not all software cases can be covered withaddress checking however. When a CFGvertex is targeted from two or more multiplefan-out vertices the Block Origin Addresscheck becomes ine�ective.

We have called Interleaved Signature In-

struction Stream (ISIS) to our signature em-bedding technique to re�ect the importantfact that signature instructions processedby the watchdog processor and main pro-cessor instructions are two completely in-dependent streams.

ISIS has been implemented into a RISC pro-cessor and the modi�cations demanded bysignature embedding to the original archi-tecture have been discussed. These mod-i�cations are very simple and can be en-abled and disabled by software with supe-ruser privileges to maintain binary compat-ibility with existing software. No speci�cfeatures of the processor has been used, sothe port of ISIS to a di�erent processor ar-chitecure is quite straightforward.

Memory performance overhead has beenstudied by comparison with other methodsand analysis show a memory overhead be-tween 12% and 15% although we haven'tperformed a methodical study yet. As anegligible amount of instructions are addedto the original program, the performance isexpected to remain basically unaltered.

Acknowledgements

This work is supported by the SpanishGovernment Comisión Interministerial deCiencia y Tecnología under project CICYTTAP99-0443-C05-02.

71

Bibliography

[1] Avizienis, A.: Building Dependable Sys-tems: How to Keep Up with Com-plexity. Proc. of the 25th Fault Toler-ant Computing Symposium (FTCS-25),4-14, Pasadena, California, 1995.

[2] Gunne�o, U., Karlsson, J., Torin, J.:Evaluation of Error Detection SchemesUsing Fault Injection by Heavy-ion Ra-diation. Proc. of the 19th Fault Toler-ant Computing Symposium (FTCS-19),340-347, Chicago, Illinois, 1989.

[3] Czeck, E.W., Siewieorek, D.P.: Ef-fects of Transient Gate-Level Faults onProgram Behavior. Proc. of the 20thFault Tolerant Computing Symposium(FTCS-20), 236-243, NewCastle UponTyne, U.K., 1990.

[4] Gaisler, J.: Evaluation of a 32-bitMicroprocessor with Built-in Concur-rent Error Detection. Proc. of the 27thFault Tolerant Computing Symposium(FTCS-25), 42-46, Seattle, Washington,1997.

[5] Ohlsson, J., Rimén, M., Gunne�o,U.: A Study of the E�ects of Tran-sient Fault Injection into a 32-bit RISCwith Built-in Watchdog. Proc. of the22th Fault Tolerant Computing Sympo-sium (FTCS-22), 316-325, Boston, Mas-sachusetts, 1992.

[6] Siewiorek, D.P.: Niche Sucesses toUbiquitous Invisibility: Fault-TolerantComputing Past, Present, and Future.Proc. of the 25th Fault Tolerant Com-puting Symposium (FTCS-25), 26-33,Pasadena, California, 1995.

[7] Oh, N., Shirvani, P.P., McCluskey, E.J.:Control Flow Checking by Software Sig-natures. IEEE Transactions on Reliabil-ity - Special Section on Fault TolerantVLSI Systems, March, 2001.

[8] Galla, T.M., Sprachmann, M.,Steininger, A., Temple, C.: ControlFlow Monitoring for a Time-TriggeredCommunication Controller. Proceedingsof the 10th European Workshop onDependable Computing (EWDC-10),43-48, Vienna, Austria, 1999.

[9] Gaisler, J.: Concurrent Error-Detectionand Modular Fault-Tolerance in an32-bit Processing Core for EmbeddedSpace Flight Applications. Proc. of the27th Fault Tolerant Computing Sym-posium (FTCS-24), 128-130, Austin,Texas, 1994.

[10] Kim, S., Somani, A.K.: On-Line In-tegrity Monitoring of MicroprocessorControl Logic. Proc. Intl. Conferenceon Computer Design: VLSI in Comput-ers and Processors (ICCD-01), 314-319,Austin, Texas, 2001.

[11] Nickel, J.B., Somani, A.K.: REESE:A Method of Soft Error Detection inMicroprocessors. Proc. of the 2001 Intl.Conference on Dependable Systems andNetworks (DSN-2001), 401-410, Gote-borg, Sweden, 2001.

[12] Rotenberg, E.: AR-SMT: A Microar-chitectural Approach to Fault Tolerancein Microprocessors. Proc. of the 29thFault Tolerant Computing Symposium(FTCS-29), 84-91, Madison, Wisconsin,1999.

72

[13] Weaver, C., Austin, T.: A Fault Tol-erant Approach to Microprocessor De-sign. Proc. of the 2001 Intl. Conferenceon Dependable Systems and Networks(DSN-2001), 411-420, Goteborg, Swe-den, 2001.

[14] Mendelson, A., Suri, N.: DesigningHigh-Performance & Reliable Super-scalar Architectures. The Out of Or-der Reliable Superscalar (O3RS) Ap-proach. Proc. of the 2000 Intl. Confer-ence on Dependable Systems and Net-works (DSN-2000), 473-481, New York,USA, 2000.

[15] Rashid, F., Saluja, K.K., Ramanathan,P.: Fault Tolerance Through Re-execution in Multiscalar Architecture.Proc. of the 2000 Intl. Conferenceon Dependable Systems and Networks(DSN-2000), 482-491, New York, USA,2000.

[16] IEEE Std. 1076-1993: VHDL Lan-guage Reference Manual. The Instituteof Electrical and Electronics EngineersInc., New York, 1995.

[17] MIPS32 Architecture for Program-mers, volume I: Introduction to theMIPS32 Architecture. MIPS Technolo-gies, 2001.

[18] AMBA Speci�cation rev2.0. ARMLimited, 1999.

[19] Wildner, U.: Experimental Evaluationof Assigned Signature Checking WithReturn Address Hashing on Di�erentPlatforms. Proc. of the 6th Intl. Work-ing Conference on Dependable Com-

puting for Critical Applications, 1-16,Grainau, Germany, 1997.

[20] Hennessy, J.L., Patterson, D.A.: Com-puter Architecture. A Quantitative Ap-proach, 2nd edition, Morgan-Kau�mannPub., Inc., 1996.

[21] Ohlsson, J., Rimén, M.: ImplicitSignature Checking. Proc. of the 25thFault Tolerant Computing Symposium(FTCS-25), 218-227, Pasadena, Califor-nia, 1995.

[22] Shirvani, P.P., McCluskey, E.J.: Fault-Tolerant Systems in a Space Environ-ment: The CRC ARGOS Project. Cen-ter for Reliable Computing, TechnicalReport CRC-98-2, Standford, Califor-nia, 1998.

73

The HORUS Processor

F. Rodríguez, J. C. Campelo and J. J. Serrano

Grupo de Sistemas Tolerantes a Fallos - Fault Tolerant Systems GroupDept. Informática de Sistemas y Computadores

Universidad Politécnica de Valencia, 46022 Valencia (Spain),e-mail: {prodrig, jcampelo, jserrano}@disca.upv.es,

http://www.disca.upv.es/gstf

Abstract

Designing a complete SoC or reuse SoCcomponents to create a complete system isa common task nowadays. The �exibilityo�ered by the design �ow used o�ers the de-signer an unprecedented capability to incor-porate more and more demanded featureslike error detection and correction mecha-nisms to increase the system dependabil-ity. This paper describes the design of theHORUS processor, a RISC processor aug-mented with a concurrent error mechanismand the architectural modi�cations neededon the original design. Taking advantageof modern high-level design methodologyand using the VHDL modeling language,the standard architecture has been slightlymodi�ed to minimize the resulting perfor-mance penalty.

1. Introduction

With the advent of modern technologiesin the �eld of programmable devices andenormous advances in the software toolsused to model, simulate and translate intoreal hardware almost any complex digitalsystem, the capability to design a wholeSystem-On-Chip (SoC) has become a re-ality even for small companies. With thewidespread use of embedded systems in oureveryday life, service availability and de-pendability concerns for these systems areincreasingly important [1].

A SoC is usually modelled using a HardwareDescription Language (HDL) like VHDL[2]. It allows a hierarchical description ofthe system and the designed elements inter-connect much the same way as they wouldin a graphical design �ow, but using an ar-bitrary abstraction level. It also providesIO facilities to easily incorporate test vec-tors, and language assertions to verify the

correct behaviour of the model during thesimulation.

E�cient error detection is of fundamentalimportance in dependable computing sys-tems. As the vast majority of faults aretransient, the use of a concurrent Error De-tection Mechanism (EDM) is of utmost in-terest as high coverage and low detection la-tency characteristics are needed to recoverthe system from the error. And as experi-ments demonstrate [3, 4, 5, 6], a high per-centage of non-overwritten errors results incontrol �ow errors.

The possibility to modify the original ar-chitecture of a processor modelled usingVHDL gives the SoC designer an unprece-dented capability to incorporate EDM'swhich were previously available at large de-sign companies only.

Siewiorek states in [7] that �To succeed inthe commodity market, fault-tolerant tech-niques need to be sought which will betransparent to end users�. A fault-toleranttechnique can be considered transparentonly if results in minimal performance over-head in silicon, memory size or processorspeed. Although redundant systems canachieve the best degree of fault-tolerance,the high overheads imposed limit their ap-plicability in every day computing elements.The same limitation applies when a soft-ware only solution is used, due to the pro-cessor's performance penalty incurred.

Siewiorek's statement can be also trans-lated into the SoC world, to demand fault-tolerant techniques that minimise their im-pact on performance (the scarcest resourcein such systems) if we want those techniques

to be used at all.

The work presented here describes the de-sign of the HORUS processor. It is a clas-sic pipelined RISC processor designed inVHDL that has been augmented with a con-current error detection mechanism of con-trol �ow errors with no performance penaltyand minimal memory and silicon sizes. Nomodi�cations are needed in the instructionset of the processor used and the architec-tural ones are so small that they can beenabled and disabled under software con-trol to allow binary compatibility with ex-isting software. The watchdog processor isvery simple, and its design can be appliedto other RISC processors as well.

The paper is structured as follows: Thenext section is devoted to present a setof basic de�nitions and it is followed bythe outline of related works in the litera-ture. Section 4 presents the signature em-bedding technique the system uses. Section5 discusses the processor architecture andits compiler support. A memory overheadcomparison with similar work is performedafterwards, which is followed by some pre-liminary synthesis results to �nish with theconclusions.

2. Basic terms

A computer program can be represented bya Control Flow Graph (CFG). Vertices inthis graph are used to represent basic blocksand directed arcs are used to represent legalpaths between blocks.

75

A basic block is a sequence of instructionsto be executed in order, with no branch tar-gets except for the very �rst instruction andwith no branch instructions except possiblythe last one (if any).

A derived signature is a value assigned toeach instruction block. The term derivedmeans the signature is not an arbitrarily as-signed value but calculated from the block'sinstructions. Derived signatures are usuallyobtained xoring the instruction opcodes orusing such opcodes to fed a Linear Feed-Back Shift Register (LFSR). These valuesare calculated at compile time and used asreference by the EDM to verify correctnessof executed instructions.

If signatures are interspersed or hashedwith the processor instructions the methodis generally known as Embedded SignatureMonitoring (ESM). A watchdog processoris an EDM hardware device used to de-tect Control Flow Errors (CFE) and/or cor-ruption of the instructions executed by theprocessor, usually employing derived signa-tures and an ESM technique. In this caseit performs signature calculations from theinstruction opcodes that are actually exe-cuted by the main processor, checking theserun-time values against their references. Ifany di�erence is found the error in the mainprocessor instruction stream is detected andan Error Recovery Mechanism (ERM) is ac-tivated.

3. Related work

Several approaches using an ESM watchdogprocessor and derived signatures for con-current error detection have been proposedin the literature. The most relevant worksare outlined below. Other recent proposalsin the concurrent error detection �eld arebased in last generation processor featureslike support for simultaneous multithread-ing or superscalar architectures. These ap-proaches are valid, but of very limited usein SoC designs however, due to the simplerarchitectures these processors are based on.

Ohlsson et al present in [6] a watchdog pro-cessor built into a RISC processor. A spe-cialized tst instruction is inserted in thedelay slot of every branch instruction, test-ing the signature of the preceding block. Aninstruction counter is also used to time-outthe instructions sequence when a branch in-struction is not executed in the speci�edrange. Other watchdog supporting instruc-tions are added to the processor instructionset to save and restore the value of the in-struction counter on procedure calls.

The watchdog processor used by Galla etal in [8] to verify correct execution ofa communications controller of the Time-Triggered Architecture uses a similar ap-proach. A check instruction is inserted inappropriate places to trigger the checkingprocess with the reference signature that isstored in the subsequent word. In the caseof a branch, the branch delay slot is used toinsert an adjustment value for the signatureto ensure the run-time signature is the sameat the check instruction independent of the

76

path followed. The watchdog also uses aninstruction counter. This is loaded duringthe check instruction and decremented forevery instruction executed; a time-out is is-sued if the counter reaches zero before anew check instruction is executed. Due tothe nature of the communications architec-ture, no interrupts are processed by the con-troller, thus saving the run-time signatureor instruction counter is not necessary.

The ERC32 is a SPARC processor aug-mented with parity bits and a program �owcontrol mechanism presented by Gaisler in[9]. In the ERC32 a test instruction to ver-ify the processor control �ow is also insertedin the delay slot of every branch to verifythe instruction bits of the preceding block.In his work, the test instruction is a slightlymodi�ed version of the original nop instruc-tion and no other modi�cations to the in-struction set is needed.

4. ISIS: Interleaved Sig-

natures Instruction

Stream

All ESM watchdogs presented in the pre-ceding section require processor cycles tocheck instruction signatures so a perfor-mance penalty results inevitable. As thelength of block sequences of a RISC pro-cessor is between 4 and 10 instructions, thememory and performance overhead is quitenoticeable.

To reduce performance overhead, thescarcest resource when targeting to �eld-

programmable devices, the main CPUshould not process signatures in any way.With this objective in mind, we have de-signed a CPU that skips an instruction perbasic block while maintaining CPU instruc-tion sequencing.

Those instruction gaps are �lled with blocksignatures and processed by the watchdogprocessor. With this arrangement, twocompletely independent interleaved instruc-tion streams coexist in our system: the ap-plication instruction stream, which is di-vided into blocks and executed by the mainprocessor and the signature stream, used bythe watchdog processor. We have calledInterleaved Signature Instruction Stream(ISIS) to our technique due to this fact.More information about the error detectionmechanisms included with the block signa-ture can be found in [10].

Isolating the reference signatures from theinstructions fed into the processor pipelineresults in a minimal performance overheadin the application program. To achieve thisisolation, several architecture modi�cationsare needed in the main processor. These aresummarized below:

a) When a conditional branch instructionends a basic block, a second block fol-lows immediately. The second block'ssignature sits between them, and themain processor must skip it. In orderto e�ectively jumping over the signa-ture, the signature size is added to theProgram Counter if the branch is nottaken.

b) When a procedure call is executed, theblock to be executed after returning

77

the procedure immediately follows the�rst one. Again, the second block'ssignature must be taken into accountwhen calculating the procedure returnaddress. And again, this is achieved byan automatic addition of the signaturesize to the PC when the procedure iscalled.

c) In the preceding cases, the processorcan perform PC additions when ex-ecutes a branch or call instruction.However, when a fall-through block isexecuted, the processor has no way todetermine when the last instruction ofthe block arrives. A fall-through blockis a basic block that does not end witha branch instruction; that is, it endsbecause the instruction that follows theblock is the target of a branch. Tohelp the CPU, the compiler inserts ajump instruction to signal the end ofthe block. This is the only case wherea processor instruction must be addedin order to isolate main processor fromthe signature stream. Figure 1 showsan example of an if-then-else con-struct with such a block and Fig. 2the same construct after the jump in-struction and the signatures have beenadded.

5. The HORUS processor

The system (see Fig. 3) is built around theHORUS processor, a soft-core of a MIPSR3000 processor clone developed in synthe-sizeable VHDL. It is a four stage pipelined

Figure 1: If-then-else construct with a fall-through block

RISC processor running the MIPS-I andMIPS-II Instruction Set Architecture [11]except that it does not provide �oatingpoint support. It has been augmented toinclude a watchdog processor implementingthe ISIS technique. Instruction and data

Figure 2: If-then-else construct after signa-tures and jump instruction inserted

78

Figure 3: HORUS processor and overall system architecture

bus interfaces are designed as AMBA [12]AHB bus masters providing memory accessfor main and watchdog processors.

This processor is provided with a MemoryManagement Unit (MMU) to perform vir-tual to physical address mapping, isolat-ing memory areas of di�erent processes andchecking correct alignment of memory ref-erences.

To minimize performance penalty, the in-struction cache is designed with two readports that can provide two instructions si-multaneously, one for each processor. On acache hit, no interference exists even if theother processor is waiting for a cache linere�ll because of a cache miss.

To reduce the instruction cache complexitya single write port is provided that must beshared by both processors. When simulta-neous cache misses happen, cache re�lls areserved in a First-Come First-Served fash-ion. If they happen in the same clock cycle,the main processor is promoted.

This arrangement takes advantage of spacelocality in the application program to aug-ment cache hits for signatures. As we usean ESM technique and signatures are in-terleaved with processor instructions, whenboth processors produce a cache miss theyrequest the same memory block most of thetimes, as both reference words in the sameprogram area with very few cycles of di�er-ence.

No modi�cation is needed in the processorinstruction set due to the fact that signa-ture instructions are neither fetched nor ex-ecuted by the main processor. This allowsus to maintain binary compatibility withexisting software. If access to the sourcecode is not possible, the program can berun without modi�cation (and no concur-rent error detection capability will be pro-vided).

This is possible because the watchdog pro-cessor and processor's modi�ed architecturecan be enabled and disabled under software

79

control running with superuser privileges.If these features are disabled, our processorbehaves as an o�-the-shelf MIPS processor.Thus, if binary compatibility is needed fora given task, the OS must disable these fea-tures every time the task resumes execu-tion.

The watchdog processor is fed with the in-structions from the main processor pipelineas they are retired. When these instructionsenter the watchdog the run-time signaturesare calculated at the same rate of the ar-rived instructions. When a block ends,these values are stored in a FIFO memoryto decouple the signature checking process.

This FIFO allows a large set of instructionsto be retired from the pipeline while thewatchdog is waiting for a cache re�ll in or-der to get a reference signature instruction.In a similar way, the watchdog can emptythe FIFO while the main processor pipelineis stalled due to a memory operation. Whenthis memory is full, the pipeline if forced towait for the watchdog checking process toread some data from the FIFO.

5.1. HORUS compiler support

The GNU gcc compiler already provideda port to target MIPS processors. As itssource code is freely available it was the nat-ural starting point to provide the requiredsoftware support for the HORUS processor.The gas program (GNU Assembler) hasthe responsibility of the assembly stage inthe compilation process, after program op-timization passes and before the �nal linkerstage.

Gas has been modi�ed to support the HO-RUS MIPS modi�ed architecture and itsoptional use of the ISIS technique via com-mand line switches.

As instructions are assembled,

1. If the current instruction is the targetof a branch instruction, a new blockmust start and so its signature mustbe inserted.

2. If the current instruction is a branch,the next instruction will �ll the branchdelay slot and end the current block.

With this information and the opcode bitsof the program instructions the assemblercan calculate block signatures and insertthem at appropriate places.

No provisions are needed to modify the tar-get of a branch or call instruction, as all in-struction addresses are referred using sym-bolic names (labels).

6. System comparison

In this section the HORUS processor iscompared against the most related worksin the literature, those of Ohlsson et al pre-sented in [6] and Gaisler presented in [9].

Memory requirements for block signaturesare fundamentally the same as Gaisler'sprocessor [5]. His processor uses a testinstruction to �ll every branch delay slot.That is, a test instruction is used for ev-ery block exactly the same as the signature

80

instruction is used for every block in theHORUS processor.

The system architecture of Ohlsson et al'swork requires additional instructions atprocedure entry and exit points, so its mem-ory requirements are larger.

The main di�erences are in the processorcycles used by the error detection mech-anisms. While the HORUS processor de-mands no CPU cycles to process block sig-natures, Gaisler's one wastes one cycle perblock and Galla's requires additional cyclesto process procedures. These di�erences re-�ect the fact that signatures are not instruc-tions for the HORUS processor.

The HORUS performance is however af-fected by the watchdog processor.

First, it shares the instruction memorythrough the cache controller with the mainprocessor. As both processors reference thesame space locations at approximately thesame time (with a few clock cycles of dif-ference) the instruction cache contains theblock signature referenced by the watchdogprocessor most of the times.

Second, to solve the problems arose withfall-through blocks the compiler inserts ajump instruction in the processor instruc-tion stream. Some �gures about the mem-ory consumption by signatures in the HO-RUS processor are presented in [10]. Re-sults show the memory overhead (15 % to28 %) is comparable to Ohlsson's TRIP pro-cessor (13 % to 25 %).

6.1. Logic synthesis

Preliminary synthesis results show a sili-con overhead about 8,5 %. This has beenobtained comparing the synthesis resultsfrom two di�erent versions: the initial de-sign (without watchdog processor) and thewhole system previously described in thispaper.

The device used for this synthesis has beenthe Virtex2 xc2v6000-4 from Xilinx. Thisdevice has been selected because it o�ersa large pool of logic resources and rout-ing. This way, the overall equivalent gatecount provided by the synthesis tools isdominated by the logic used not the rout-ing resources. This equivalent gate countprovides an overall complexity mark. Thisvalue must be taken with caution, however,as it is a global estimation made by the soft-ware tool from the di�erent logic resourcesused: logic cells, distributed memory cells,routing elements, and dedicated memoryblocks.

Some technical details must be explained tounderstand this result. Firstly, the instruc-tion cache and the MMU provide a singleport that is multiplexed to provide data tomain and watchdog processors. This mul-tiplexing circuitry creates two �virtual� ac-cess ports, one for the main processor andthe second one for the watchdog processor.It works at twice the frequency of the mainprocessor and so does not imply any addi-tional delay cycles for the processor. Thisway, the silicon overhead for the inclusion ofthe watchdog processor is reduced, as it ac-counts for the multiplexing circuitry only.The di�erence between the initial design

81

and the design augmented by the watchdogprocessor can be roughly considered to bethese multiplexing circuits and the watch-dog processor itself.

Secondly, large logic blocks originally de-signed with a goal of technology indepen-dence have been redesigned to take advan-tage of speci�c macro blocks o�ered bythe tool vendor (Xilinx). These blocksinclude the content addressable memories(CAMs) used by the instruction cache andthe MMU, and the FIFO memories insidethe watchdog processor used to connect thesignature run-time calculation and refer-ence checking processes.

7. Conclusions

The architecture of the HORUS processorhas been presented. It is a RISC pipelinedprocessor augmented with a concurrent er-ror detection mechanism (a watchdog pro-cessor) speci�cally designed to minimize theCPU performance penalty. This goal is ofprimary importance to successfully employfault tolerance mechanisms in the currentmarket of SoC systems, where the clock fre-quency when targeting �eld programmabledevices is far from the general-purpose per-sonal computer ranges.

The performance overhead reduction isachieved modifying the standard meaningof branch and call instructions. This is pos-sible when architecture modi�cations arepossible; increasingly larger devices, mod-ern design methodologies and its tools nowo�er this possibility to the designer.

Compiler support for this architecture hasbeen also outlined to be of reasonable com-plexity.

Finally, a light comparison with similarpropositions has been carried out. Whilememory requirements are fundamentallythe same as previous proposals, perfor-mance bene�ts from the fact that signa-tures are neither fetched nor processed bythe CPU.

The HORUS processor has been designedusing a synthesizeable subset of the VHDLlanguage and it is currently under beta test.Preliminary results show the logic neededin include the watchdog processor increasesthe system size by a moderate 8,5 % factor.Characterization of the watchdog processoris currently underway.

Acknowledgements


Bibliography

[1] Avresky, D., Grosspietsch, K.E., Jhonson,D.W., and Lombardi, F.: Embedded Fault Tol-erant Systems. IEEE Micro Magazine, 8-11,Vol. 18, No. 5, 1998.

[2] IEEE Std. 1076-1993: VHDL Language Refer-ence Manual. The Institute of Electrical andElectronics Engineers Inc., New York, 1995.

[3] Gunne�o, U., Karlsson, J., Torin, J.: Evalua-tion of Error Detection Schemes Using Fault

82

Injection by Heavy-ion Radiation. Proc. ofthe 19th Fault Tolerant Computing Symposium(FTCS-19), 340-347, Chicago, Illinois, 1989.

[4] Czeck, E.W., Siewieorek, D.P.: E�ects of Tran-sient Gate-Level Faults on Program Behav-ior. Proc. of the 20th Fault Tolerant Comput-ing Symposium (FTCS-20), 236-243, NewCas-tle Upon Tyne, U.K., 1990.

[5] Gaisler, J.: Evaluation of a 32-bit Micropro-cessor with Built-in Concurrent Error Detec-tion. Proc. of the 27th Fault Tolerant Com-puting Symposium (FTCS-27), 42-46, Seattle,Washington, 1997.

[6] Ohlsson, J., Rimén, M., Gunne�o, U.: A Studyof the E�ects of Transient Fault Injection intoa 32-bit RISC with Built-in Watchdog. Proc. ofthe 22th Fault Tolerant Computing Symposium(FTCS-22), 316-325, Boston, Massachusetts,1992.

[7] Siewiorek, D.P.: Niche Sucesses to UbiquitousInvisibility: Fault-Tolerant Computing Past,Present, and Future. Proc. of the 25th FaultTolerant Computing Symposium (FTCS-25),26-33, Pasadena, California, 1995.

[8] Galla, T.M., Sprachmann, M., Steininger, A.,Temple, C.: Control Flow Monitoring fora Time-Triggered Communication Controller.Proceedings of the 10th European Workshop onDependable Computing (EWDC-10), 43-48, Vi-enna, Austria, 1999.

[9] Gaisler, J.: Concurrent Error-Detection andModular Fault-Tolerance in an 32-bit Process-ing Core for Embedded Space Flight Applica-tions. Proc. of the 24th Fault Tolerant Comput-ing Symposium (FTCS-24), 128-130, Austin,Texas, 1994.

[10] Rodríguez, F., Campelo, J.C, Serrano, J.J.:A Memory Overhead Evaluation of the In-terleaved Signature Instruction Stream. to bepresented at the IEEE Intl. Symposium onDefect and Fault Tolerance in VLSI Systems(DFT'2002), Vancouver, Canada, Nov. 2002.

[11] MIPS32 Architecture for Programmers, vol-ume I: Introduction to the MIPS32 Architec-ture. MIPS Technologies, 2001.

[12] AMBA Speci�cation rev2.0. ARM Limited,1999.

83

Delivering Error Detection Capabilities into a Field

Programmable Device: The HORUS Processor Case

Study

F. Rodríguez, J. C. Campelo, J. J. Serrano

Dept. Informática de Sistemas y Computadores,Universidad Politécnica de Valencia

e-mail: {prodrig, jcampelo, jserrano}@disca.upv.es

Abstract

Designing a complete SoC or reuse SoCcomponents to create a complete system isa common task nowadays. The �exibilityo�ered by current design �ows o�ers the de-signer an unprecedented capability to incor-porate more and more demanded featureslike error detection and correction mecha-nisms to increase the system dependability.This is especially true for programmable de-vices, were rapid design and implementa-tion methodologies are coupled with testingenvironments that are easily generated andused. This paper describes the design of theHORUS processor, a RISC processor aug-mented with a concurrent error mechanism,the architectural modi�cations needed onthe original design to minimize the resultingperformance penalty.

1. Introduction

With the advent of modern technologiesin the �eld of programmable devices andenormous advances in the software toolsused to model, simulate and translate intoreal hardware almost any complex digitalsystem, the capability to design a wholeSystem-On-Chip (SoC) has become a re-ality even for small companies. With thewidespread use of embedded systems in oureveryday life, service availability and de-pendability concerns for these systems areincreasingly important [1].

E�cient error detection is of fundamentalimportance in dependable computing sys-tems. As the vast majority of faults aretransient, the use of a concurrent Error De-tection Mechanism (EDM) is of utmost in-terest as high coverage and low detection la-tency characteristics are needed to recoverthe system from the error. And as experi-

ments demonstrate [2, 3], a high percentageof non-overwritten errors results in control�ow errors.

The possibility to modify the original ar-chitecture of a processor modelled using alanguage like VHDL gives the SoC designeran unprecedented capability to incorporateEDM's which were previously available atlarge design companies only.

Siewiorek states in [4] that �To succeed inthe commodity market, fault-tolerant tech-niques need to be sought which will betransparent to end users�. A fault-toleranttechnique can be considered transparentonly if results in minimal overhead in sili-con, memory size or processor performance.Although redundant systems can achievethe best degree of fault-tolerance, the highoverheads implied limit their applicabilityin �eld programmable devices, were siliconis the scarcest resource. The same limita-tion applies when a software only solutionis used, due to the processor's performancepenalty incurred.

Siewiorek's statement can be also inter-preted as a demand to implement fault-tolerant techniques that minimise their im-pact on the main processor clock if we wantthose techniques to be used at all.

The work presented here describes the over-heads incurred in the HORUS processor [5]to incorporate a concurrent EDM. HORUSis a classic pipelined RISC processor de-signed in VHDL and synthesized into a Xil-inx FPGA that has been augmented with aconcurrent EDM of control �ow errors withminimal performance and silicon penaltyand moderate memory overhead.

Figure 1: Original system architecture.

The paper is structured as follows: Thenext section is devoted to present theoverall system architecture, including thewatchdog processor and its connection withthe rest of the system. Section 3 presentsthe overhead results of our approach, takinginto account the memory footprint, perfor-mance loss and silicon complexity, to �nishwith the conclusions.

2. Introduction of the

HORUS processor ar-

chitecture

The original processor architecture (see Fig-ure 1) is built around a soft-core of a MIPSR3000 processor clone developed in synthe-sizeable VHDL. It is a four stage pipelinedRISC processor running the MIPS-I andMIPS-II Instruction Set Architecture ex-cept that it does not provide �oating pointsupport. Instruction and data bus inter-faces are designed as ARM's AMBA AHBbus masters providing memory access formain and watchdog processors.

85

This processor is provided with a MemoryManagement Unit (MMU) to perform vir-tual to physical address mapping, isolat-ing memory areas of di�erent processes andchecking correct alignment of memory ref-erences.

2.1. Error detection mecha-

nism

The main processor has been augmented toinclude a watchdog processor implementingthe ISIS (Interleaved Signature InstructionStream) technique [5]. This watchdog pro-cessor is capable of detecting control �owerrors and instruction errors in the mainprocessor.

The instructions for this watchdog pro-cessor (called signatures) are interleavedwith the instructions of the main processor.However, the main processor automaticallyskips them and conversely, the watchdogprocessor fetch signatures only and jumpsover the main processor instructions.

No modi�cation is needed in the processorinstruction set due to the fact that watch-dog instructions are neither fetched nor exe-cuted by the main processor. This allows usto maintain binary compatibility with exist-ing software. If access to the source code isnot possible, the program can be run with-out modi�cation (and no concurrent errordetection capability will be provided).

The modi�cations to the original processorarchitecture are described in detail in [5].These can be enabled and disabled undersoftware. If these features are disabled, our

Figure 2: Augmenting the system with thewatchdog processor.

processor behaves as an o�-the-shelf MIPSprocessor. Thus, if binary compatibility isneeded for a given task, the OS must disablethese features every time the task resumesexecution.

To minimize the resulting performancepenalty, the instruction cache is designedwith two read ports that can provide two in-structions simultaneously, one for the mainprocessor and one for the watchdog. If theprocessor gets a cache hit, no interferenceexists even if the other processor is stalledbecause of a cache miss.

This arrangement takes advantage of spacelocality in the application program to aug-ment cache hits for signatures. As sig-natures are interleaved with processor in-structions, when both processors produce acache miss they request the same memoryblock most of the times, as both referencewords in the same program area with veryfew cycles of di�erence.

To detect errors, the watchdog processor is

86

fed with the instructions from the main pro-cessor as they are retired from its pipeline(see Figure 2). The main processor in-structions are treated as data by the watch-dog, and are processed at their arrival rate.Every time the main processor executes abranch instruction, the watchdog processorchecks the main processor execution �owcomparing a run-time calculated value withits reference (the signature). If a mismatchis encountered, an exception is raised to sig-nal the error detection.

As the watchdog processor needs somememory data (the signature) to perform thecheck, it can not be guaranteed that it isready by the time the processor executesthe branch. To decouple the checking pro-cess, a FIFO memory is used to store therun-time calculated values while the signa-ture is fetched. This FIFO allows a largeset of instructions to be retired from themain processor pipeline while the watchdogis waiting for the signature (due to a cachemiss). In a similar way, the watchdog canempty the FIFO while the main processorpipeline is stalled due to a memory opera-tion.


The next subsections are devoted to presentthe overhead related to the inclusion ofthe watchdog processor in the system de-sign. This analysis includes memory, per-formance, clock frequency and silicon.

3.1. Memory overhead

The table 1 below shows some measure-ments obtained using a modi�ed version ofthe gnu compiler. The data produced in-clude the number of instructions of the orig-inal program, the number of instructionsinserted to allow the main processor jumpover signatures and the number of signa-tures.

These programs were selected as a represen-tation of the type of programs (iterative,recursive, mixed) expected to be run intothese systems.

�t. It is the Fast Fourier Transform appliedto a random set of values. It is a goodrepresentation of a sequential program,as it has a single block with 168 (!) se-quential instructions.

hanoi. This is the classic programmingproblem of hanoi towers.

quicksort. This program sorts a randomlyinitialized array of numbers. It is a mixof sequential statements and recursionas the speci�c version extensively usesunrolled loops and function inlining.

matrix. This program solves an integermatrix multiply.

queens. It solves the classic nine queensplacement problem.

Although memory overhead may seem largeat �rst sight, comparison of these data in[6] demonstrates they are similar to theoverhead obtained with previous publishedwatchdogs.

87

Table 1: Memory overheadProgram Original instr. Inserted instr. Signatures added OverheadFft 281 11 33 15,66 %Hanoi 118 0 23 19,50 %Quicksort 844 36 203 28,32 %Matrix 139 5 27 23,02 %Queens 305 11 59 22,95 %

3.2. Performance overhead

To obtain the performance overhead wehave simulated the VHDL model of theFPGA device and directed it to executethe test programs presented in the previ-ous section. Two versions of each programhave been simulated, the original one (withno signatures, simulated with the watch-dog processor disabled) and the versionaugmented with signatures for the watch-dog processor (simulated with the watchdogprocessor enabled).

The table 2 shows the number of CPU cy-cles needed to complete the execution of thetest programs. It is evident from these datathat the main processor sees no apprecia-ble sign of degraded performance when thewatchdog processor is in use.

With a memory overhead above 19 %, theperformance result from the hanoi test pro-gram seems incredible. However, this re-sult can be explained if we take into ac-count that the main processor is always pro-moted when simultaneous cache misses oc-cur. Analyzing the simulation trace in de-tail, results shown that the main processorwas always executing a set of instructionswhile the watchdog processor was waitingfor the signature word to check precisely

those instructions. That is, the main pro-cessor was always several instructions aheadthe checking process of the watchdog pro-cessor. When the cache fetches the signa-ture requested by the watchdog processor,the main processor is generating cache hitsand executing one instruction per cycle.

It is interesting to note that the quick-sort program, with the largest memorypenalty, shows a roughly negligible perfor-mance overhead. This can be explained be-cause performance penalty is related to howwell the instruction cache is used by the testprogram, not by the static memory foot-print of the program itself.

Another interesting conclusion comes fromthe analysis of the results for the matrixprogram. Being a small program consistingmainly in nested loops (that is, a good useof the instruction cache), shows the worstperformance however. Analyzing the pro-gram source code, and from the simulationtrace, it is evident that this program is dom-inated by the data memory access time.As the program performs lots of reads andwrites, and there is no data cache, the prob-ability of interference between a cache re�llof the instruction cache due to a watchdogmiss and the memory operation increases.If the cache re�ll has already started, the

88

Table 2: Performance penalty

ProgramCPU cycles

OverheadWithout watchdog With watchdog

Fft 892 940 5,38 %Hanoi 2717 2732 0,55 %Quicksort 510 520 1,96 %Matrix 330 350 6,06 %Queens 925 967 4,54 %

memory operation is delayed (as instructionand data paths share the AMBA bus) thusincreasing the time the CPU needs to �nishthe program.

To our knowledge, there is no performanceoverhead analysis in the watchdog proces-sors presented in the literature. However,taking the proposal in [3] as an example,the memory footprint increase sits between12 % and 25 %. If the program being exe-cuted has few data movement instructions(like the hanoi towers test program), anddue to the fact that signatures must be pro-cessed as normal instructions in that pro-posal, one could expect a performance over-head of 10 % in the best case.

3.3. Clock frequency and sili-

con overhead

In the case of �eld programmable devices,it is well known that di�erent sub-elementscan achieve very di�erent operation fre-quencies. In our case, the main processor isthe slowest element and the rest of the sys-tem (including watchdog, instruction cacheand MMU) can use a clock with at leastdouble frequency. That is, watchdog did

not a�ect processor or cache clock.

We have taken advantage of this fact toreduce the complexity of the instructioncache, reducing the watchdog frequency tomatch main processor to simplify the de-sign. Instead of creating a true dual-portcache, the cache has a single read port andsome glue logic to provide read access forboth processors at one half of its operat-ing frequency. As the operating frequencyof the processors is one half the cache fre-quency, both processors have read access atits maximum frequency and can be fed withdata every cycle (on cache hits).

The same time multiplexing technique hasbeen used in the MMU to deliver addresstranslations for both processors in the sameCPU cycle with a single access port MMU.

The whole system has been targeted to aXilinx's Virtex2 device (a xc2v6000-4). Toobtain the silicon overhead derived from theuse of the watchdog processor, two versionshave been created, synthesized and com-pared. The �rst one has a true single-portinstruction cache and no watchdog and thesecond one incorporates all these elementspreviously described.

As the device technology incorporates many

89

di�erent programmable elements (�ip-�ops,combinational logic functions, memoryblocks, tristate bu�ers), there is no singlevalue that can measure the silicon cost ofboth versions to perform a fair comparison.However, the Xilinx tools provide a total"equivalent" number of gates, a rough mea-sure of system complexity that can be ofsome use. Comparing those numbers, theoverall silicon overhead due to the inclusionof the error detection mechanism is 8,49 %.

4. Conclusions

The architecture of the HORUS processorhas been presented. It is a RISC pipelinedprocessor augmented with a concurrent er-ror detection mechanism (a watchdog pro-cessor) speci�cally designed to minimize theCPU performance penalty. This goal isof primary importance to successfully em-ploy fault tolerance mechanisms in the cur-rent market of �eld-programmable systems,where the clock frequency is far from thegeneral-purpose personal computer ranges.

The performance overhead reduction isachieved modifying the standard meaningof branch and call instructions. This is onlypossible when the processor is modeled us-ing a language like VHDL and thus archi-tecture modi�cations are possible. The re-sults obtained from several test programsshow the performance penalty is 6 % or be-low, and that is negligible in some cases,depending on the program structure and itsuse of the instruction cache.

While memory requirements are fundamen-

tally the same as previous proposals, per-formance bene�ts from the fact that signa-tures are neither fetched nor processed bythe CPU. This fact and the results from thesilicon studies (less than 8,5 % of silicon in-crease, the CPU clock is not a�ected) willpromote the use of such a solution whena concurrent error detection mechanism isneeded in a �eld-programmable device.

The HORUS processor has been designedusing a synthesizeable subset of the VHDLlanguage and it is currently under beta test.Characterization of the watchdog processor(percentage of errors detected, mean num-ber of cycles to detect the error) is currentlyunderway.

Bibliography





[5] Rodríguez, F., Campelo, J.C, Serrano, J.J.: AWatchdog Processor Architecture with Minimal

90

Performance Overhead. Proc. of the 21st Safetyand Reliability Conference (SAFECOMP'02),Catania (Italy), Sept. 2002.

[6] Rodríguez, F., Campelo, J.C, Serrano, J.J.:A Memory Overhead Evaluation of the In-terleaved Signature Instruction Stream. IEEEIntl. Symposium on Defect and Fault Toler-ance in VLSI Systems (DFT'2002), Vancouver,Canada, Nov. 2002.

91

A Memory Overhead Evaluation of the Interleaved

Signature Instruction Stream

F. Rodríguez, J. C. Campelo, and J. J. Serrano

Grupo de Sistemas Tolerantes a Fallos - Fault Tolerant Systems GroupDept. Informática de Sistemas y Computadores

Universidad Politécnica de Valencia, 46022 Valencia (Spain),e-mail: {prodrig, jcampelo, jserrano}@disca.upv.es,

http://www.disca.upv.es/gstf

Abstract

Using a watchdog processor for concur-rent error detection of a processor execu-tion �ow is a well-known technique to in-crease the dependability of a microproces-sor system. Most approaches embed refer-ence signatures for the watchdog processorinto the processor instruction stream cre-ating noticeable memory and performanceoverheads.

The Interleaved Signature InstructionStream (ISIS) technique is a signatureembedding technique that allows signa-tures to co-exist with the main instructionstream with a minimal impact on processorperformance, without sacri�cing errordetection coverage or latency.

This technique has been implemented intoHORUS, a MIPS-like RISC processor de-veloped in VHDL. This paper presents the

HORUS architecture novelties demandedby ISIS, discusses the performance impactof adding an ISIS watchdog processor andprovides results of ISIS memory overhead.These results are compared against similarsolutions previously presented in the litera-ture.

1. Introduction

In the "Model for the Future" foreseen byAvizienis in [1] the urgent need to incorpo-rate dependability to every day computingis clear: �Yet, it is alarming to observe thatthe explosive growth of complexity, speed,and performance of single-chip processorshas not been paralleled by the inclusion ofmore on-chip error detection and recoveryfeatures�.

E�cient error detection is of fundamental

92

importance in dependable computing sys-tems. As the vast majority of faults aretransient, the use of a concurrent Error De-tection Mechanism (EDM) is of utmost in-terest as high coverage and low detection la-tency characteristics are needed to recoverthe system from the error. And as experi-ments demonstrate [2, 3, 4, 5], a high per-centage of non-overwritten errors results incontrol �ow errors.

An application program is divided intobranch-free intervals [5], called instructionblocks. A derived signature is a value as-signed to each instruction block. The termderived means the signature is not an arbi-trarily assigned value but calculated fromthe block's instructions. Derived signaturesare usually obtained xoring the instructionopcodes or using such opcodes to feed a Lin-ear FeedBack Shift Register (LFSR). Thesevalues are calculated at compile time andused as reference by the EDM to verify cor-rectness of executed instructions.

If signatures are interspersed or hashedwith the processor instructions the methodis generally known as Embedded SignatureMonitoring (ESM). A watchdog proces-

sor [6] is a hardware EDM used to de-tect Control Flow Errors (CFE) and/or cor-ruption of the instructions executed by theprocessor, usually employing derived signa-tures and an ESM technique. In this caseit performs signature calculations from theinstruction opcodes that are actually exe-cuted by the main processor, checking theserun-time values against their references. Ifany di�erence is found the error in the mainprocessor instruction stream is detected andan Error Recovery Mechanism (ERM) is ac-

tivated.

The Interleaved Signature InstructionStream (ISIS) technique is an ESM tech-nique that intersperses signatures and themain processor instructions, and allowsthe inclusion of a watchdog processorinto a complex microprocessor system.It has been implemented into a complexMIPS-like RISC processor designed inVHDL called HORUS. To minimize theimpact on the main processor performanceof using signatures for concurrent errordetection, the processor does not fetch orexecute signatures.

The paper is structured as follows: Thenext section outlines ESM techniques pre-viously proposed in the literature. Section3 is devoted to introduce the HORUS ar-chitecture and how the watchdog proces-sor is included into the system, to presentthe ISIS technique and the processor mod-i�cations in the section 4. Next sectiondiscusses how signatures impact main pro-cessor performance and a memory overheadcomparison with similar work is performedafterwards, to �nish with the conclusions.

2. Related work

Several hardware approaches using a watch-dog processor and derived signatures forconcurrent error detection have been previ-ously proposed in the literature. The mostrelevant works are outlined below:

Ohlsson et al. present in [5] a watchdogprocessor built into a RISC processor. A

93

specialized tst instruction is inserted in thedelay slot of every branch instruction, test-ing the signature of the preceding block. Aninstruction counter is also used to time-outan instruction sequence when a branch in-struction is not executed in the speci�edrange (signaling a branch deletion error).Other watchdog supporting instructions areadded to the main processor instruction setto save and restore the value of the instruc-tion counter on procedure calls.

The watchdog processor used by Galla etal. in [7] to verify correct execution ofa communications controller of the Time-Triggered Architecture uses a similar ap-proach. A check instruction is inserted inappropriate places to trigger the checkingprocess with the reference signature that isstored in the subsequent word. In the caseof a branch, the branch delay slot is usedto insert an adjustment value for the signa-ture to ensure the run-time signature is thesame at the check instruction independentof the path followed. An instruction counteris also used by the watchdog. The counteris loaded during the check instruction anddecremented for every instruction executed;a time-out is issued if the counter reacheszero before a new check instruction is exe-cuted. Due to the nature of the communi-cations architecture, no interrupts are pro-cessed by the controller. Thus saving therun-time signature or instruction counter isnot necessary.

The ERC32 is a SPARC processor aug-mented with parity bits and a program �owcontrol mechanism presented by Gaisler in[8]. In the ERC32, a test instruction to ver-ify the processor control �ow is also inserted

in the delay slot of every branch to verifythe instruction bits of the preceding block.In his work, the test instruction is a slightlymodi�ed version of the original nop instruc-tion and no other modi�cations to the in-struction set is needed.

A di�erent error detection mechanism ispresented by Kim and Somani in [9]. Thedecoded signals for the pipeline control arechecked in a per instruction basis and theirreferences are retrieved from a watchdogprivate cache. If the run-time signature of agiven instruction can't be checked becauseits reference counterpart is not found in thecache, it is stored in the cache and used asreference for future executions. No signa-tures or program modi�cations are neededbecause reference signatures are generatedat run-time, thus creating no overhead.The drawback in this approach is that thewatchdog processor can't check all instruc-tions. An instruction can be checked onlyif it has been previously executed and itsreference has not been displaced from thewatchdog private cache to store other signa-tures. Although the error is detected beforethe instruction is committed and no over-heads are created, the error coverage is pooras there is no guarantee a given instructioncan be checked with a valid reference.

3. HORUS Architecture

The system (see Fig. 1) is built around asoft-core of a MIPS R3000 processor clonedeveloped in synthesizeable VHDL [10] at

94

RTL level. We have called it HORUS1.

HORUS is a 4-stage pipelined RISC pro-cessor running the MIPS-I and MIPS-IIInstruction Set Architecture [11]. Theinternal bus for this SoC follows theAMBA's [12] multimaster Advanced High-Performance Bus (AHB) speci�cation. In-structions and data are retrieved from ex-ternal memory using two separate AHBmasters to improve bus utilization.

This processor is provided with a MemoryManagement Unit (MMU) inside the Sys-tem Control Coprocessor (CP0 in the MIPSnomenclature) to perform virtual to phys-ical address mapping, isolate memory ar-eas of di�erent processes and check correctalignment of memory references.

To minimize performance penalty, the in-struction cache is designed with two readports that can provide two instructions si-multaneously, one for each processor. On acache hit, no interference exists even if theother processor is waiting for a cache linere�ll because of a cache miss.

A single write port is provided to access theAHB bus, so it must be shared by both pro-cessors. When simultaneous cache misseshappen, cache re�lls are served in a First-Come First-Served fashion. If they happenin the same clock cycle, the main processoris promoted.

This arrangement takes advantage of spacelocality in the application program to aug-ment cache hits for signatures. As we usean ESM technique and signatures are in-

1HORUS is the name of an ancient Egyptiangod, son of ISIS and OSIRIS

terleaved with processor instructions, whenboth processors produce a cache miss theyrequest the same memory block most of thetimes, as both reference words in the sameprogram area.

No modi�cation is needed in the processorinstruction set due to the fact that signa-ture instructions are neither fetched nor ex-ecuted by the main processor. This allowsus to maintain binary compatibility withexisting software. If access to the sourcecode is not possible, the program can berun without modi�cation (and no concur-rent �ow error detection capability will beprovided). This is possible because thewatchdog processor and processor's modi-�ed architecture can be enabled and dis-abled under software control running withsuperuser privileges. If these features aredisabled, our processor behaves as an o�-the-shelf MIPS processor. Thus, if binarycompatibility is needed for a given task,these features must be disabled by the OSevery time the task resumes execution.

The watchdog processor is fed with the in-structions from the main processor pipelineas they are retired. When these instruc-tions enter the watchdog the run-time sig-natures and address parity bits are calcu-lated at the same rate of the arrived in-structions. When a block ends, these valuesare stored in a FIFO memory to decouplethe signature checking process. This FIFOallows a large set of instructions to be re-tired from the pipeline while the watchdogis waiting for a cache re�ll in order to geta reference signature instruction. In a sim-ilar way, the FIFO can be emptied by thewatchdog while the main processor pipeline

95

Figure 1: System architecture

is stalled due to a memory operation. Whenthis memory is full, the pipeline if forced towait for the watchdog checking process toread some data from the FIFO.

4. Interleaved Signature

Instruction Stream

Contrary to other ESM techniques wheresignatures are placed in the delay slot ofthe branch instruction �nishing the block,signatures are placed at the beginning ofevery basic block in the ISIS scheme [13].These references incorporate, among otherchecking mechanisms, block signature andblock length.

Besides error detection capabilities ob-tained from the block signature, and dueto the fact that the block reference wordincludes the block length and that it canbe retrieved by the watchdog processor assoon as the block begins, branch insertionand branch deletion errors can be detected.

The signature instruction encoding hasbeen designed in such a way that a mainprocessor instruction can not be misinter-preted as a watchdog signature instruction.This provides an additional check when abranch instruction is executed by the mainprocessor. This check consists in the re-quirement to �nd a signature instructionimmediately preceding the �rst instructionof every block. This also helps to detect aCFE if a branch erroneously reaches a sig-nature instruction, because the used encod-ing will force an illegal instruction exceptionto be raised.

Additional checking related with the signa-ture instruction type and partial jump ad-dress veri�cation are also included.

4.1. Processor Architecture

Modi�cations

Signatures are used by the watchdog pro-cessor only and not processed in any wayby the main processor. To achieve this iso-

96

Figure 2: An if-then-else example (a). After block signatures and jump insertion (b)

lation from the main processor, they mustbe skipped at run-time or by means of jumpinstructions.

Isolating the reference signatures from theinstructions fed into the processor pipelineresults in a minimal performance overheadin the application program. Slight architec-ture modi�cations are needed in the mainprocessor in order to achieve it.

First of all, when a conditional branch in-struction ends a basic block, a second blockfollows immediately. The second block'ssignature sits between them, and the mainprocessor must skip it. In order to e�ec-tively jumping over the signature, the signa-ture size is added to the Program Counterif the branch is not taken.

In the same way, when a procedure call in-struction ends a basic block the next oneto be executed after the procedure returnsimmediately follows. Again, the secondblock's signature must be taken into ac-count when calculating the procedure re-turn address. And again, this is achievedby an automatic addition of the signaturesize to the PC.

Additions to the PC mentioned above canbe automatically generated at run-time be-cause the control unit decodes a branch orprocedure call instruction at the end of theblock. The instruction is a clear indicationthat the block end will arrive soon. As theprocessor has a pipelined architecture, thenext instruction is executed in all cases (thisis known as the branch delay slot), so thecontrol unit has a clock cycle to prepare forthe addition. Despite the fact that the in-struction in the delay slot is placed afterthe branch, it logically belongs to the sameblock, as it is executed even if the branch istaken.

However, in the case of a block fall-through[13] the control unit has no clue todetermine when the �rst block ends, so thesignature can not be automatically jumpedover. In this case, the compiler (currently,only the gnu C compiler, gcc) explicitlyadds an unconditional jump to skip it. Thisis the only case where a processor instruc-tion must be added in order to isolate mainprocessor from the signature stream. Figure2a shows an example of an if-then-else

construct with a fall-through block that

97

Table 1: Memory overhead analysisProgram Blocks Instrs Delay Jumps Signatures Total overhead

slots�t 20 281 20 11 33 44 (15.66 %)hanoi 21 118 21 0 23 23 (19.50 %)quicksort 199 844 165 36 203 239 (28.32 %)heapsort 52 315 45 9 56 65 (20.63 %)matrix 25 139 20 5 27 32 (23.02 %)queens 54 305 46 11 59 70 (22.95 %)

needs such an addition (shown in Fig. 2b).

The block length �eld used by the watchdogprocessor imposes restrictions to the lengthof the blocks, also. The compiler insertsjumps when the block length exceeds themaximum allowed, currently established insixteen instructions.


Table 1 show some measurements obtainedby a modi�ed version of the gcc compiler.This has been tailored to produce blockinformation about the compiled programs.The data produced include the number ofsequential blocks and instructions of theoriginal source, and how many of them are�nished by a branch/call instruction andits associated delay slot. When the com-piler is instructed to generate ISIS signa-tures, it also provides the number of in-serted jumps (due to fall-through blocks ortoo large blocks) and signatures.

The jumps column includes the number ofjumps inserted by the compiler. When ajump instruction is inserted, a nop instruc-

tion may be also inserted to �ll its delayslot if no instruction of the original block isschedulable. In this case, the nop instruc-tion is also counted in.

The analysed programs have been compiledwith no optimization options, and so theresults shown are worst case values. Theseprograms are:

Fft is the Fast Fourier Transform ap-plied to a random set of values. Itis a good representation of sequen-tial programs, as it has a single blockwith 168 (!) instructions that ISISmust split into several blocks to accom-modate them to its maximum blocklength.

Hanoi, quicksort and heapsort solvesthe hanoi tower problem and sort arandomly initialized array of numbers,respectively.

Matrix solves an integer matrix multi-ply and queens solves the classic ninequeens placement problem

ISIS presents a memory overhead similar tothose shown by similar approaches before,

98

as outlined in the next section. However,the additional error detection capabilitiesprovided (block length, jump address andinstruction signatures) compensates for thisoverhead, taking into account that the mainprocessor has only to process the insertedjumps. This means a reduced impact onits execution performance contrary to pre-vious techniques which demand CPU cyclesto process signatures.

5.1. Comparison with related

work

A purely software approach to concurrenterror detection was evaluated by Wildnerin [14]. This control �ow EDM is calledCompiler-Assisted Self Checking of Struc-tural Integrity (CASC) and it is based onaddress hashing and signature justifying toprotect the return address of every proce-dure. At the procedure entry, the returnaddress of the procedure is extracted fromthe link register into a general-purpose reg-ister to be operated on. The �rst opera-tion is the inversion of the LSB bit of thereturn address to provide a misalignmentexception in the case of a CFE. An add in-struction at each basic block is inserted tojustify the procedure signature and, at theexit point, the �nal justifying and reinver-sion of the LSB bit is calculated and theresult is transferred to the link register be-fore returning from the procedure. In thecase of a CFE, the return address obtainedis likely to cause a misalignment exceptionthus catching the error. The experimentscarried out on a RISC SPARC processor re-sulted in a memory codesize overhead for

the SPECint92 benchmarks varying from0 % to 28 % depending on the run-time li-brary used and the benchmark itself.

The hardware watchdog of Ohlsson et al.presented in [5] use a tst instruction perbasic block, taking advantage of the branchdelay slot of a pipelined RISC processorcalled TRIP. One of the detection mecha-nisms used by the watchdog is an instruc-tion counter to issue a time-out exception ifa branch instruction is not executed duringthe speci�ed interval. When a procedure iscalled two instructions are inserted to savethe block instruction counter and anotherinstrucion is inserted at the procedure endto restore it. Their watchdog code size over-head is evaluated to be between 13 % and25 %.

6. Conclusion

We have outlined ISIS signature embed-ding technique and how it has been im-plemented into HORUS, a soft-core of apipelined RISC processor.

The modi�cations demanded by signatureembedding to the original processor archi-tecture have been discussed. These mod-i�cations are very simple and can be en-abled and disabled by software with supe-ruser privileges to maintain binary compat-ibility with existing software. No speci�cfeatures of the processor has been used, sothe port of ISIS to a di�erent processor ar-chitecture is quite straightforward.

Memory performance overhead has been

99

analyzed using a modi�ed version of thegcc compiler to extract the program in-formation needed. The resulting overhead(from 15 % up to 28 %) is comparable withother signature embedding methods previ-ously proposed in the literature.

Several advantages distinguishes the ISIStechnique, however. Error detection capa-bilities include signature processing, blocklength count, block type veri�cation, andothers.

We are currently obtaining experimentaldata on performance penalty and error de-tection coverage and latency. As expected,initial results show a small performanceoverhead between 0.5 % and 6 %, depend-ing on the test program. These results areachieved because both processors share notonly the instruction cache, but temporaland spatial locality also. That is, only asmall subset of watchdog fetches interferewith the normal processor execution.

Acknowledgements


Bibliography

[1] Avizienis, A.: Building Dependable Sys-tems: How to Keep Up with Com-plexity. Proc. of the 25th Fault Toler-

ant Computing Symposium (FTCS-25),4-14, Pasadena, California, 1995.

[2] Gunne�o, U., Karlsson, J., Torin, J.:Evaluation of Error Detection SchemesUsing Fault Injection by Heavy-ion Ra-diation. Proc. of the 19th Fault Toler-ant Computing Symposium (FTCS-19),340-347, Chicago, Illinois, 1989.

[3] Czeck, E.W., Siewieorek, D.P.: Ef-fects of Transient Gate-Level Faults onProgram Behavior. Proc. of the 20thFault Tolerant Computing Symposium(FTCS-20), 236-243, NewCastle UponTyne, U.K., 1990.

[4] Gaisler, J.: Evaluation of a 32-bitMicroprocessor with Built-in Concur-rent Error Detection. Proc. of the 27thFault Tolerant Computing Symposium(FTCS-25), 42-46, Seattle, Washington,1997.

[5] Ohlsson, J., Rimén, M., Gunne�o,U.: A Study of the E�ects of Tran-sient Fault Injection into a 32-bit RISCwith Built-in Watchdog. Proc. of the22th Fault Tolerant Computing Sympo-sium (FTCS-22), 316-325, Boston, Mas-sachusetts, 1992.

[6] Mahmood, A., McCluskey, E.J.: Con-current Error Detection Using Watch-dog Processors - A Survey. IEEE Trans-actions on Computers, 37(2): 160-174,1988.

[7] Galla, T.M., Sprachmann, M.,Steininger, A., Temple, C.: ControlFlow Monitoring for a Time-TriggeredCommunication Controller. Proceedingsof the 10th European Workshop on

100

Dependable Computing (EWDC-10),43-48, Vienna, Austria, 1999.

[8] Gaisler, J.: Concurrent Error-Detectionand Modular Fault-Tolerance in an32-bit Processing Core for EmbeddedSpace Flight Applications. Proc. of the27th Fault Tolerant Computing Sym-posium (FTCS-24), 128-130, Austin,Texas, 1994.

[9] Kim, S., Somani, A.K.: On-Line In-tegrity Monitoring of MicroprocessorControl Logic. Proc. Intl. Conferenceon Computer Design: VLSI in Comput-ers and Processors (ICCD-01), 314-319,Austin, Texas, 2001.

[10] IEEE Std. 1076-1993: VHDL Lan-guage Reference Manual. The Instituteof Electrical and Electronics EngineersInc., New York, 1995.

[11] MIPS32 Architecture for Program-mers, volume I: Introduction to theMIPS32 Architecture. MIPS Technolo-gies, 2001.

[12] AMBA Speci�cation rev2.0. ARMLimited, 1999.

[13] Rodríguez, F., Campelo, J.C., Serrano,J.J.: A Watchdog Processor Architec-ture with Minimal Performance Over-head. To be presented at the SAFE-COMP'2002, Catania, Italy, September2002.

[14] Wildner, U.: Experimental Evaluationof Assigned Signature Checking WithReturn Address Hashing on Di�erent

Platforms. Proc. of the 6th Intl. Work-ing Conference on Dependable Com-puting for Critical Applications, 1-16,Grainau, Germany, 1997.

101

Improving the Interleaved Signature Instruction

Stream Technique


Dept. Informática de Sistemas y ComputadoresUniversidad Politécnica de Valencia

C/ Camino de Vera S/N, 46022 - ValenciaSPAIN

{prodrig, jcampelo, jserrano}@disca.upv.es,http://www.disca.upv.es/gstf

Abstract

Control �ow monitoring using a watchdogprocessor is a well-known technique to in-crease the dependability of a microproces-sor system. Most approaches embed refer-ence signatures for the watchdog processorinto the processor instruction stream cre-ating noticeable memory and performanceoverheads. A novel embedding signaturestechnique called Interleaved Signatures In-struction Stream has been recently pre-sented. Targeted to processors includedinto �eld-programmable devices, its maingoal is to reduce the performance penaltyinduced by the watchdog processor in pre-vious proposals. The work presented hereimproves the ISIS technique and o�ers a so-lution to the memory overhead without sac-ri�cing performance, thus yielding a betteroverall architecture we have called OSIRIS:

Another Interleaved Signature InstructionStream.

Keywords Error detection, Embeddedsignature monitoring, Reliability, Fault-tolerance, Microprocessors

1. Introduction

In the �Model for the Future� foreseen byAvizienis in [1] the urgent need to incorpo-rate dependability to every day computingis clear: �Yet, it is alarming to observe thatthe explosive growth of complexity, speed,and performance of single-chip processorshas not been paralleled by the inclusion ofmore on-chip error detection and recoveryfeatures�.

E�cient error detection is of fundamental

importance in dependable computing sys-tems. As the vast majority of faults aretransient, the use of a concurrent Error De-tection Mechanism (EDM) is of utmost in-terest as high coverage and low detection la-tency characteristics are needed to recoverthe system from the error. And as experi-ments demonstrate [2, 3, 4, 5], a high per-centage of non-overwritten errors results incontrol �ow errors.

Siewiorek states in [6] that �To succeed inthe commodity market, fault-tolerant tech-niques need to be sought which will betransparent to end users�. A fault-toleranttechnique can be considered transparentonly if results in minimal performance over-head in silicon, memory size or processorspeed.

Although redundant systems can achievethe best degree of fault-tolerance, the highoverheads implied limit their applicabilityin every day computing elements.

The work presented here provides concur-rent detection of control �ow errors withminimal impact on the system performance,memory consumption and silicon sizes. Nomodi�cations are required into the architec-ture or the instruction set of the processorused as testbed in order to add a new in-struction for the watchdog processor. Theerror detection capabilities can be enabledand disabled under software control to allowbinary compatibility with existing software.The watchdog processor is very simple, andits design can be applied to other processorsas well.

This work is derived from the InterleavedSignature Instruction Stream (ISIS) to im-

prove its memory overhead.

The paper is structured as follows: Thenext section is devoted to introduce somebasic terms in the �eld of watchdog pro-cessors and it is followed by the outline ofthe ISIS technique. Section 4 presents theOSIRIS technique and the resulting systemarchitecture where the watchdog is embed-ded. Some discussion on performance andmemory overhead with similar work is per-formed afterwards, to �nish with the con-clusions.

2. Introduction to watch-

dog processors

A minimal set of basic terms taken from [5]is needed to understand the overall system.A branch-in instruction is an instructionused as the target address of a branch or callinstruction (for example, the �rst instruc-tion of a procedure or function). A branch-out instruction is an instruction capable tobreak the sequential execution �ow, condi-tionally or unconditionally (for example, aconditional branch or a procedure call in-struction). A basic block is a sequence ofinstructions with no branch-in instructionsexcept the very �rst one and no branch-outinstructions except possibly the last one.

A derived signature is a value assigned toeach instruction block. The term derivedmeans the signature is not an arbitrarily as-signed value but calculated from the block'sinstructions. Derived signatures are usuallyobtained xoring the instruction opcodes orusing such opcodes to feed a Linear Feed-

103

back Shift Register (LFSR). These valuesare calculated at compile time and used asreference by the EDM to verify correctnessof executed instructions.

If signatures are interspersed or hashedwith the processor instructions the methodis generally known as Embedded SignatureMonitoring (ESM). A watchdog processoris a hardware EDM used to detect Con-trol Flow Errors (CFE) and/or corruptionof the instructions executed by the proces-sor, usually employing derived signaturesand an ESM technique. In this case itperforms signature calculations from theinstruction opcodes that are actually exe-cuted by the main processor, checking theserun-time values against their references. Ifany di�erence is found the error in the mainprocessor instruction stream is detected andan Error Recovery Mechanism (ERM) is ac-tivated.

3. The HORUS processor

In [7], a novel technique to embed signa-tures into the processor's instruction streamis presented. Its main goal is the reductionof the performance impact of the watchdogprocessor and it is targeted to processorsincluded into �eld-programmable devices.

This technique, called ISIS (InterleavedSignature Instruction Stream), hash thewatchdog processor signatures and appli-cation processor's instruction in the samememory area. Signatures are interleavedwithin instruction basic blocks, but theseinstructions are never fetched nor executed

by the main processor.

3.1. System Architecture

The ISIS technique has been implementedin the HORUS processor [8], a soft-coreclone of the MIPS R3000 [9] RISC processor(see Fig. 1). It is a four stage pipelined pro-cessor with a complete Memory Manage-ment Unit and instruction cache. The ex-ternal memory and peripherals are accessedtrough an AMBA AHB bus [10].

The original processor architecture hasbeen augmented with a watchdog proces-sor. The instruction cache is modi�ed toinclude two read ports to provide simulta-neous access to both processors (main andwatchdog).

Figure 1: The initial HORUS architecture.

The watchdog processor (see Fig. 2) re-ceives the main processor instructions asthey are retired from the pipeline, perform-ing the run-time calculations at the samerate the instructions are retired. When abasic block is �nished, the run-time val-ues are stored in a FIFO memory. The

104

checking process between reference signa-tures and run-time values reads from thisFIFO and the instruction cache to performthe match.

Figure 2: Modi�ed architecture to includethe watchdog processor using ISIS.

In Fig. 3, basic blocks for a conditionalif-then-else statement are shown. Instruc-tion addresses are shown in square brack-ets and lines indicate legal paths betweenblocks. Continuos lines are used to di�er-entiate paths taken by means of an explicitjump from paths between blocks implicitlyfollowed by the processor simply becauseinstructions are executed in sequential or-der. These implicit paths are signaled withdashed lines.

For example, the if-block ends with a con-ditional branch instruction targeted at in-struction j. In the case the condition is met,the branch is taken and the execution �owis explicitly changed to address j. In thecase the condition is not met the branch isnot taken and execution continues with thenext instruction (at address i + 1). How-

ever, this sequential �ow implicitly movedthe execution �ow from the if-block to thethen-block.

Figure 3: Basic blocks for a conditional if-then-else statement.

The main processor architecture has beenmodi�ed to automatically skip watchdogsignatures when possible. When the pro-cessor executes a conditional branch, theinstruction that follows is automaticallyskipped even if the branch is not taken.The same applies when a procedure call isexecuted and the return address has to bestored.

These architectural modi�cations createword gaps in the main processor instructionstream immediately following branches andcalls. And a specialized compiler to storewatchdog signatures uses precisely thesegaps.

In Fig. 4, the basic blocks of the sameconditional statement are shown after thewatchdog signatures have been added fol-lowing the ISIS technique. In this �gure,for example, the processor jumps over the

105

watchdog signature at address i+1 becausethe processor executes a conditional branchinstruction at address i.

If the condition is met, the next instruc-tion to be executed is stored at address j;if not, the processor skips the word afterthe branch and executes the instruction ataddress i+ 1.

Not all signatures can be jumped over how-ever. Following with the same example thatis shown in Fig. 4, the signature at addressk + 2 can not be automatically skipped.This is because the processor has no hint todetermine the else-block is �nishing whenthe instruction at address k is executed. Inthis case, the specialized compiler insertsan unconditional jump at address k + 1 toskip the signature targeting to the instruc-tion at address k+2. This jump is requiredbecause the signature instructions are en-coded in such a way that the main processorwould consider them as illegal instructions,raising the corresponding exception.

Figure 4: Basic blocks and interleaved sig-natures for the conditional statement.

3.2. Performance and memory

overhead

In [11] some performance results from theHORUS processor are presented. These re-sults are summarized in the table 1 andclearly show the goal of minimizing the per-formance penalty when the watchdog pro-cessor is in use has been achieved.

The test programs (�t, hanoi, quicksortand so on) are classic problems used asbenchmarks as they re�ect di�erent pro-gram types: sequential, iterative and recur-sive. In all the cases, the penalty if 6 % orbelow, and strongly recursive programs willbe a�ected by a negligible 0,5 % of perfor-mance loss.

Table 1: Performance results of the HORUSprocessor.

CPU cycles

ProgramWithout With

Overheadwatchdog watchdog

Fft 892 940 5,38 %

Hanoi 2717 2732 0,55 %

Quicksort 510 520 1,96 %

Matrix 330 350 6,06 %

Queens 925 967 4,54 %

Although performance overhead is minimal,memory consumption su�ers from the factthat a signature word must be inserted foreach basic block in the original program,and sometimes an explicit jump instructionmust be also inserted to ensure the mainprocessor will skip watchdog signatures.

106

Memory overhead results for the ISIS tech-nique applied to the HORUS processor forthe test programs above were presented in[8] and summarized in the table 2.

It can be easily observed that the resultingmemory penalty is not the strong point ofthe ISIS technique, as the mean overhead is24,14 % and can reach the 28 % for sometest cases. Although quite large, these num-bers are roughly the same results o�ered byother watchdog processors previously pre-sented in the literature [5, 12].

4. Improving the Memory

Overhead: OSIRIS

A new embedding technique is proposedhere. It is a mixture between classical ap-proaches like those in [5, 13] and ISIS. It isaimed to reduce memory overhead withoutsacri�cing performance.

Classical watchdog processor approachesembed signatures in the delay slot of branchinstructions. The delay slot is the instruc-tion that immediately follows a branch, andin a RISC processor is always executed,even if the branch is taken. That is, itis stored immediately after the branch in-struction but it logically belongs to thesame basic block the branch belongs to.Modern compilers do instruction reorderingto select an appropriate instruction just be-fore the branch in order to �ll the delay slot.If no instruction can be selected, the slot is�lled with a no-operation instruction. In-stead of performing instruction reordering,the delay slot is always �lled with the block

signature if a watchdog processor is used.

The memory overhead results of those solu-tions are similar to ISIS. They are howevereasier to implement, as signature checkingis executed like any other arithmetic oper-ation of the Execution Unit (EU). That is,the instruction used to trigger the signaturechecking process is like any other instruc-tion.

This is the weakest point of those ap-proaches. Signatures must be processed bythe main processor like any instruction, sothe performance loss is larger that the 6 %of ISIS.

To summarize, the watchdog processorshould be a separate unit, not included intothe processor execution unit to minimizeperformance loss. At the same time, cre-ating a completely independent stream ofsignature instructions as proposed with theISIS technique or using dedicated instruc-tions like a classic approach result in a largememory overhead.

To solve memory overhead and separate thewatchdog processor from the execution unitof the main processor we have developed an-other embedding technique. We have calledAnother Interleaved Signature InstructionStream (OSIRIS) to this new technique.

The main idea behind this technique is totake advantage of unused �elds of instruc-tions of a basic block to store the signatureof the block. Only when there is no enoughroom in these unused �elds a new instruc-tion is inserted. Even in this last case, theinstruction is a no-operation instruction forthe main processor, and its unused �elds

107

Table 2: Memory consumption of the watchdog signatures in the HORUS processor.Program Original Inserted Signatures Overhead

instr. instr. addedFft 281 11 33 15,66 %Hanoi 118 0 23 19,50 %Quicksort 844 36 203 28,32 %Matrix 139 5 27 23,02 %Queens 305 11 59 22,95 %

will store the remaining signature bits.

The MIPS instruction set, like most RISCprocessors instruction sets, is designed withthe goal to achieve simplicity in the decod-ing process. This, combined with the factthat all instructions has the same length (32bits), tend to create instructions with someunused �elds.

Figure 5: MIPS instruction formats.

The instruction formats of the HORUS pro-cessor is shown in Fig. 5, where rs, rt ad rdare register numbers. From the three in-struction formats, the R format (the mostused one) has between 5 and 15 unused bits,depending on the exact instruction.

This fact has been used in the MIPS16 ar-chitecture to create an instruction with thespeci�c goal to reduce the length of everyinstruction from 32 to 16 bits. We proposeto �ll these unused bits with the block sig-nature information.

Analyzing the instructions used by test pro-grams mentioned in the previous section,the mean number of unused bits is 5,8 bitsper instruction. And taking into accountthat the mean length of a basic block ina RISC processor sits between 7 and 8 in-structions [14], the mean number of avail-able bits in a block is more than enough tostore its signature.

There are always small blocks that do notprovide enough bits. In this case, a nop in-struction is inserted at the end of the block.The unused bits of this instruction will pro-vide enough room for the remaining bits ofthe block signature.

4.1. System Architecture using

OSIRIS

The resulting system architecture is nowsimpler, as shown in Fig. 6. The instructioncache has now a single read port, and themain instructions enter both the main pro-cessor pipeline (to be executed) and the ref-erence matching part of the watchdog pro-cessor (to extract the block signature).

The watchdog processor is more complexhowever, as it has to �lter the main proces-

108

Figure 6: Processor architecture using theOSIRIS technique.

sor instructions to extract the block signa-ture bits from unused �elds of the executedinstructions.

In order to maintain binary compatibilitywith existing software, the watchdog pro-cessor can be enabled and disabled un-der software control running with superuserprivileges.

4.2. Performance and Memory

Analysis

Although we have no experimental resultsyet to assess the performance or memoryoverhead of the OSIRIS technique, a pre-liminary analysis is possible.

As blocks larger than 6 instructions do notusually require an additional nop to storethe block signature, there is no performanceor memory penalty for those blocks.

For the other blocks, a single instructionis added, just like the proposal in [13], ex-

cept that this last one requires this inser-tion for every block. As the ISIS techniquerequires a signature instruction per block,and a jump instruction for some blocks, itsmemory overhead must be larger than theOSIRIS technique here proposed.

With regard to performance, it is evidentthat performance with OSIRIS is bettercompared with classic approaches, as fewerinstructions are inserted. In both cases, in-structions added must be processed by themain processor and consume CPU cycles.

Comparing performance overhead ofOSIRIS with ISIS results it is no so ob-vious. ISIS performance overhead is verysmall, and comes from the instructioncache accesses mainly. Only when per-formance results with the same test casesmodi�ed to embed signatures following theOSIRIS technique the comparison wouldbe possible and fair.

5. Conclusion

We have presented a novel technique to em-bed signatures into the execution �ow of aRISC processor that provides a set of errorchecking procedures to assess that the �owof executed instructions is correct.

All these checking mechanisms are per-formed in a per block basis, in order to re-duce the error detection latency of our hard-ware Error Detection Mechanism.

We have called Another Interleaved Signa-ture Instruction Stream (OSIRIS) to oursignature embedding technique to re�ect

109

this work is derived from the recently pre-sented ISIS technique, whose goal is tominimize performance loss in processorsincluded into �eld-programmable devices.The OSIRIS objective is to maintain perfor-mance and reducing the memory overhead,thus promoting the use of concurrent errordetection mechanisms in new designs.

The main idea behind this technique is totake advantage of unused �elds of instruc-tions of a basic block to store its signature.Only when there is no enough room in theseunused �elds a new instruction is inserted.Even in this case, the instruction is a no-operation instruction for the main proces-sor, and its unused �elds will store the re-maining signature bits.

No modi�cations are required to the pro-cessor architecture or its instruction set. Inorder to maintain binary compatibility withexisting software, the watchdog processorcan be enabled and disabled under softwarecontrol running with superuser privileges.

The watchdog processor is very simple, andits design can be applied to other processorsas well. No speci�c features of the processorhave been used, so the port of OSIRIS to adi�erent processor is quite straightforward.The memory overhead will vary dependingon the number of unused bits in the proces-sor instruction encoding.

Performance and memory overhead havebeen analyzed by comparison with othermethods although we haven't performed amethodical study yet. As a few instruc-tions are added to the original program, theperformance is expected to remain basicallyunaltered.

Bibliography

[1] Avizienis, A.: Building Dependable Systems:How to Keep Up with Complexity. Proc. ofthe 25th Fault Tolerant Computing Symposium(FTCS-25), 4-14, Pasadena, California, 1995.

[2] Gunne�o, U., Karlsson, J., Torin, J.: Evalua-tion of Error Detection Schemes Using FaultInjection by Heavy-ion Radiation. Proc. ofthe 19th Fault Tolerant Computing Symposium(FTCS-19), 340-347, Chicago, Illinois, 1989.





[7] Rodríguez, F., Campelo, J.C, Serrano, J.J.: AWatchdog Processor Architecture with MinimalPerformance Overhead. To be presented at the21st Safety and Reliability Conference (SAFE-COMP'02), Catania (Italy), 2002.

[8] Rodríguez, F., Campelo, J.C, Serrano, J.J.:The HORUS Processor. To be presented at theXVII Conference on Design of Circuits andIntegrated Systems (DCIS'2002), Santander,Spain, 2002.

[9] MIPS32 Architecture for Programmers, volumeI: Introduction to the MIPS32 Architecture.MIPS Technologies, 2001.

110


[11] Rodríguez, F., Campelo, J.C, Serrano, J.J.:Delivering Error Detection Capabilities into aField Programmable Device: The HORUS Pro-cessor Case Study. Submitted to the IEEE In-ternational Conference on Field-ProgrammableTechnology (FPT'2002), Hong Kong, China,2002.

[12] Wildner, U.: Experimental Evaluation of As-signed Signature Checking With Return Ad-dress Hashing on Di�erent Platforms. Proc. ofthe 6th Intl. Working Conference on Depend-able Computing for Critical Applications, 1-16,Grainau, Germany, 1997.


[14] Hennessy, J.L., Patterson, D.A.: ComputerArchitecture. A Quantitative Approach, 2ndedition, Morgan-Kau�mann Pub., Inc., 1996.

111

Improving the Interleaved Signature Instruction

Stream Technique


Dept. Informática de Sistemas y Computadores,Universidad Politécnica de Valencia

C/ Camino de Vera S/N, 46022 - Valencia, [email protected], [email protected], [email protected]

Abstract

Control �ow monitoring using a watchdogprocessor is a well-known technique to in-crease the dependability of a microproces-sor system. Most approaches embed refer-ence signatures for the watchdog processorinto the processor instruction stream cre-ating noticeable memory and performanceoverheads. A novel embedding signaturestechnique called Interleaved Signatures In-struction Stream has been recently pre-sented. Targeted to processors includedinto �eld-programmable devices, its maingoal is to reduce the performance penaltyproduced by the watchdog processor in pre-vious proposals. The work presented here isan improvement of this technique and o�ersa solution to the memory overhead withoutsacri�cing performance, thus yielding a bet-ter overall architecture. We have called thisimproved technique OSIRIS: Another Inter-leaved Signature Instruction Stream.

Keywords Concurrent error detection;embedded signature monitoring; fault-tolerance

1. Introduction

E�cient error detection is of fundamentalimportance in dependable computing sys-tems. Although hardware redundancy canachieve the best degree of fault-tolerance,the high overheads implied limit their ap-plicability in every day computing elements.And due to the fact that the vast majorityof faults are transient, the use of a Concur-rent Error Detection Mechanism (CEDM)is of utmost interest as high coverage andlow detection latency characteristics areneeded to recover the system from the er-ror. As experiments demonstrate [1, 2, 3, 4],a high percentage of non-overwritten errorsresults in control �ow errors.

The work presented here provides a CEDM

of control �ow errors occurred into a generalmicroprocessor with minimal impact on thesystem performance, memory consumptionand silicon. No modi�cations of the instruc-tion set architecture are required and theCEDM can be enabled and disabled undersoftware control to allow a complete binarycompatibility with existing software. Thewatchdog processor is very simple, and itsdesign can be applied to other RISC pro-cessors as well.

This work is derived from the InterleavedSignature Instruction Stream (ISIS) as a re-quirement to improve its memory overhead,the only drawback compared with previousproposals.

The paper is structured as follows: Thenext section introduces some basic terms inthe �eld of watchdog processors and it isfollowed by the outline of the ISIS embed-ding technique and its implementation intoa MIPS processor. Section 4 presents theOSIRIS technique and the resulting systemarchitecture. Some discussion on perfor-mance and memory overhead is performedafterwards, to �nish with the conclusions.

2. Watchdog processors

A minimal set of basic terms taken from [3]is needed to understand the overall system.A branch-in instruction is an instructionused as the target address of a branch or callinstruction (for example, the �rst instruc-tion of a procedure or function). A branch-out instruction is an instruction capable tobreak the sequential execution �ow, condi-

tionally or unconditionally. A basic block isa sequence of instructions with no branch-in instructions except the very �rst one andno branch-out instructions except possiblythe last one.

A derived signature is a value assigned toeach instruction block. The term derivedmeans the signature is not an arbitrarily as-signed value but calculated from the block'sinstructions. Derived signatures are usuallyobtained xoring the instruction opcodes orusing such opcodes to feed a Linear Feed-back Shift Register (LFSR). These valuesare calculated at compile time and used asreference by the CEDM to verify correct-ness of executed instructions.

If signatures are interspersed or hashed withthe processor instructions the method isgenerally known as Embedded SignatureMonitoring (ESM). A watchdog processoris a hardware CEDM used to detect ControlFlow Errors (CFE) and/or corruption of theinstructions executed by the processor, usu-ally employing derived signatures and anESM technique. If this is the case, it per-forms run-time signature calculations fromthe instruction opcodes that are actuallyexecuted by the main processor, checkingthese run-time values against their storedreferences. If any di�erence is found the er-ror in the main processor instruction streamis detected and an Error Recovery Mecha-nism (ERM) is activated.

113

3. System architecture

with ISIS

In [5], a novel technique to embed signa-tures into the processor's instruction streamis presented. Its main goal is the reductionof the performance impact of the watchdogprocessor compared with previous. Thistechnique, called ISIS (Interleaved Signa-ture Instruction Stream), hashes the watch-dog processor signatures and applicationprocessor's instruction in the same mem-ory area. Signatures are interleaved withininstruction basic blocks, but these instruc-tions are never fetched nor executed by themain processor, as described below.

Figure 1: Modi�ed system architecture toinclude the watchdog processor using ISIS.

The ISIS technique has been implementedin the HORUS processor [6], a soft-coreclone of the MIPS R3000 [7] RISC processor(see Fig. 1). It is a four-stage pipelined pro-cessor with a complete Memory Manage-ment Unit (MMU) and instruction cache.The external memory and peripherals are

accessed trough an AMBA AHB bus [8].

The MIPS original architecture is aug-mented with a watchdog processor. The in-struction cache is modi�ed to include twoaccess ports, providing simultaneous accessto both processors (application processoruses port A and watchdog uses port B).

The watchdog processor receives the mainprocessor instructions as they are retiredfrom the pipeline, performing at run-timethe same signature calculations, and at therate the instructions are retired. When abasic block is �nished, the run-time val-ues are stored in a FIFO memory. Thechecking process between reference signa-tures and run-time values reads from thisFIFO and the instruction cache to performthe match.

Figure 2: ISIS signature embedding exam-ple.

Signatures are interleaved with applicationinstructions. The corresponding signatureprecedes every basic block of instructions.These signatures are automatically insertedinto the program memory space by the com-

114

piler (see Fig. 2), a modi�ed version of theGNU gcc compiler ported to the MIPS ar-chitecture. From the original applicationinstructions (Fig 2.a), the compiler deter-mines basic block signatures and placement,modifying the �nal executable to includethem (Fig. 2b).

Conditional branch meaning when the con-dition is not met (that is, the actions takenby the processor when the branch is nottaken) is modi�ed such that the program�ow jumps over the memory word after thebranch. This is in contrast with the defaultmeaning in a standard architecture: exe-cuting the instruction following the branch.This creates a gap between basic blockswhen a conditional branch separates them.The compiler �lls this gap with the blocksignature, which is not used by the mainprocessor.

If a branch instruction does not separatethe basic blocks, an unconditional branchinstruction is required in order to create thisgap, which is automatically inserted by thecompiler also.

In [6] performance and memory overheadresults from the HORUS processor usingthe ISIS technique are presented and sum-marized in the table 1. These results clearlyshow the goal of minimizing the perfor-mance penalty is achieved, but that mem-ory consumption is not the strongest pointof ISIS.

In all cases, the performance penalty is be-low 6 % and strongly recursive programs

Table 1: Overhead results using ISIS.Program Memory Performance

overhead overheadFft 15,66 % 5,38 %Hanoi 19,50 % 0,55 %Quicksort 28,32 % 1,96 %Matrix 23,02 % 6,06 %Queens 22,95 % 4,54 %

will be a�ected by a negligible performanceloss about 0,5 %. On the other hand,memory consumption su�ers from the factthat a signature word must be inserted foreach basic block in the original program,and sometimes an explicit jump instruc-tion must be also inserted to ensure themain processor jumps over signatures. Thismakes the memory overhead reach the 28 %for some test cases. Although quite large,these numbers are roughly the same resultso�ered by other watchdogs.

4. The OSIRIS approach

A new embedding technique is proposedhere. It is a mixture between classical ap-proaches like those in [3, 9] and ISIS, withthe goal of reducing memory overhead with-out sacri�cing performance.

Most ESM watchdog processors embed sig-natures in the delay slot of branch instruc-tions. The delay slot is the instruction thatimmediately follows a branch, and is alwaysexecuted in a RISC processor, even if thebranch is taken. That is, it is stored im-mediately after the branch instruction butit logically belongs to the same basic block

115

the branch belongs to. Instead of perform-ing instruction reordering, the delay slot is�lled with the block signature if a watchdogprocessor is used.

The memory overhead results of those solu-tions are similar to ISIS. They are howevereasier to implement, as signature checkingis executed like any other arithmetic oper-ation of the Execution Unit (EU). That is,the instruction used to trigger the signaturechecking process is like any other instruc-tion. And this is precisely their weakestpoint: as the main processor must executeinstructions to check block signatures, per-formance is degraded.

To reduce memory overhead and at thesame time separate the watchdog processorfrom the execution unit of the main pro-cessor we have developed a new embeddingtechnique, called OSIRIS (Another Inter-leaved Signature Instruction Stream).

The main idea of this technique is to takeadvantage of unused �elds of the applica-tion instructions to store the signature ofthe block. Only when there is no enoughroom in these unused �elds a new instruc-tion is inserted. If this is the case, the in-struction must be a no-operation instruc-tion for the main processor, and its un-used �elds will store the remaining signa-ture bits.

The MIPS instruction set, like most RISCprocessors instruction sets, is designed withthe goal of simplicity in the decoding pro-cess. This tend to generate instruction setswith some unused �elds.

The instruction formats of the HORUS pro-

Figure 3: MIPS instruction formats.

cessor are shown in Fig. 3, where rs, rt ad rdare 5-bit register numbers. From the threeinstruction formats, the most used is the Rformat that has between 5 and 15 unusedbits, depending on the exact instruction.This fact has been used in the MIPS16 ar-chitecture to create an instruction set withthe speci�c goal to reduce the length of in-structions from 32 to 16 bits. We proposeto �ll these unused bits with the block sig-nature information.

Analyzing the instructions used by test pro-grams mentioned in the previous section,the mean number of unused bits is 5,8 bitsper instruction. And taking into accountthat the mean length of a basic block ina RISC processor sits between 7 and 8 in-structions [10], the mean number of avail-able bits in a block is more than enough tostore its signature.

There are always small blocks that do notprovide enough bits. In this case, a nop in-struction is inserted at the end of the block.The unused bits of this instruction will pro-vide enough room for the remaining bits ofthe block signature.

4.1. System Architecture using

OSIRIS

The resulting system architecture is nowsimpler, as can be seen in Fig. 4 (only

116

modi�ed elements are shown). The instruc-tion cache has now a single access port, andthe fetched instructions enter both the mainprocessor pipeline (to be executed) and anew module of the watchdog processor (toextract the block signature).

The watchdog processor is more complexhowever, as it has to �lter the main proces-sor instructions to extract the block signa-ture bits from unused �elds of the executedinstructions.

Although we have no experimental resultsyet to assess the performance or memoryoverhead of the OSIRIS technique, a pre-liminary analysis is possible.

As blocks larger than 6 instructions do notusually require an additional nop to storethe block signature, there is no performanceor memory penalty for them.

Figure 4: System architecture usingOSIRIS.

For the other blocks, a single instructionis added, just like the proposal in [9], ex-cept that this last one requires this inser-tion for every block. As the ISIS techniquerequires a signature instruction per block,and a jump instruction for some blocks, its

memory overhead must be larger than theOSIRIS technique here proposed.

With regard to performance, it is evidentthat performance with OSIRIS is bettercompared with previous approaches (apartfrom ISIS), as fewer instructions are in-serted. Comparing performance overheadof OSIRIS and ISIS it is no so obvious.ISIS performance overhead is very small,and comes from the instruction cache si-multaneous accesses mainly. It is the cacheaccess pattern that dictates performanceloss when the ISIS technique is in use, andfrom table 1 it is no direct relationship be-tween memory and performance overheads.Only when performance results with thesame test cases are available the compari-son would be possible and fair.

5. Conclusion

We have presented a novel technique to em-bed signatures into the execution �ow of aRISC processor that provides a set of er-ror checking procedures to assess that the�ow of executed instructions is correct. Wehave called it Another Interleaved SignatureInstruction Stream (OSIRIS) to re�ect thiswork is derived from the recently presentedISIS technique, whose goal is to minimizeperformance loss in processors included into�eld-programmable devices. The OSIRISmain point is to maintain performance andreduce memory overhead, thus promotingthe use of CEDM in new designs.

We take advantage of unused �elds of in-structions of a basic block to store the

117

block's signature. Only if there is no enoughroom in those unused �elds a new instruc-tion is inserted. In order to maintain bi-nary compatibility with existing software,the watchdog processor can be enabled anddisabled under software.

The watchdog processor is very simple, andits design can be applied to other processorsas well. No speci�c features of the proces-sor have been used, so porting OSIRIS to adi�erent processor is quite straightforward.The memory overhead will vary dependingon the number of unused bits in the proces-sor instruction encoding.

Performance and memory overhead havebeen analyzed by comparison with othermethods although we haven't performed amethodical study yet.

Acknowledgements

This work is partially supported bythe Spanish Government project CICYTTAP99-0443-C05-02 and the ValencianCommunity project CTIDIA/2002/27.

Bibliography




[4] Wildner, U.: Experimental Evaluation of As-signed Signature Checking With Return Ad-dress Hashing on Di�erent Platforms. Proc. ofthe 6th Intl. Working Conference on Depend-able Computing for Critical Applications, 1-16,Grainau, Germany, 1997.

[5] Rodríguez, F., Campelo, J.C, Serrano, J.J.: AWatchdog Processor Architecture with MinimalPerformance Overhead. Proc. of the 21st Safetyand Reliability Conference (SAFECOMP'02),Catania (Italy), Sept. 2002.

[6] Rodríguez, F., Campelo, J.C, Serrano, J.J.:Delivering Error Detection Capabilities intoa Field Programmable Device: The HORUSProcessor Case Study. Proc. of the IEEE In-ternational Conference on Field-ProgrammableTechnology (FPT'2002), pp. 418-422, HongKong, China, 2002.

[7] MIPS32 Architecture for Programmers, volumeI: Introduction to the MIPS32 Architecture.MIPS Technologies, 2001.



[10] Hennessy, J.L., Patterson, D.A.: ComputerArchitecture. A Quantitative Approach, 2ndedition, Morgan-Kau�mann Pub., Inc., 1996.

118

Control Flow Error Checking with ISIS

F. Rodríguez, J. J. Serrano

Grupo de Sistemas Tolerantes a Fallos - Fault Tolerant Systems Group,Polytechnical University of Valencia, 46022, Valencia, Spain

{prodrig, jserrano}@disca.upv.eshttp://www.disca.upv.es/gstf

Abstract

The Interleaved Signature InstructionStream (ISIS) is a signature embeddingtechnique that allows signatures to co-existwith the main processor instruction streamwith a minimal impact on processor perfor-mance, without sacri�cing error detectioncoverage or latency.

While ISIS incorporate some novel error de-tection mechanisms to assess the integrityof the program executed by the main pro-cessor, the limited number of bits availablein the signature control word question if thedetection mechanisms are e�ective detect-ing errors in the program execution �ow.Increasing the signature size would nega-tively impact the memory requirements, sothis option has been rejected. The e�ec-tiveness of such mechanisms is an issue thatmust be addressed. This paper details thosechecking mechanisms included within theISIS technique that are responsible of theassessment of the integrity of the processor

execution �ow and the experiments carriedout to characterize their coverage.

1. Introduction

With the advent of modern technologies inthe �eld of programmable devices and enor-mous advances in the software tools usedto model, simulate and translate into hard-ware almost any complex digital system,the capability to design a System-On-Chip(SoC) has become a reality even for smallcompanies. With the widespread use of em-bedded systems in our everyday life, ser-vice availability and dependability concernsfor these systems are increasingly important[1].

A SoC is usually modeled using a HardwareDescription Language (HDL) like VHDL[2]. It allows a hierarchical description ofthe system and the designed elements inter-connect much the same way as they wouldin a graphical design �ow, but using an arbi-

119

trary abstraction level. It also provides IOfacilities to easily incorporate test vectors,and language assertions to verify the cor-rect behavior of the model during the sim-ulation.

E�cient error detection is of fundamentalimportance in dependable computing sys-tems. As the vast majority of faults aretransient, the use of a concurrent Error De-tection Mechanism (EDM) is of utmost in-terest as high coverage and low detection la-tency characteristics are needed to recoverthe system from the error. And as exper-iments demonstrate [3, 4, 5], a high per-centage of non-overwritten errors results incontrol �ow errors.

The possibility to modify the original archi-tecture of a processor modeled using VHDLgives the SoC designer an unprecedentedcapability to incorporate EDM's which werepreviously available at large design compa-nies only.

Siewiorek states in [6] that �To succeed inthe commodity market, fault-tolerant tech-niques need to be sought which will betransparent to end users�. A fault-toleranttechnique can be considered transparentonly if results in minimal performance over-head in silicon, memory size or processorspeed. Although redundant systems canachieve the best degree of fault-tolerance,the high overheads imposed limit their ap-plicability in everyday computing elements.The same limitation applies when a soft-ware only solution is used, due to perfor-mance losses. Siewiorek's statement canbe also translated into the SoC world, todemand fault-tolerant techniques that min-

imize their impact on performance (thescarcest resource in such systems) if thosetechniques are to be used at all.

The work presented here is structured asfollows: The next section is devoted to aminimal background on concurrent EDMs,speci�cally those using watchdog proces-sors. A section of previous work follows,where the ISIS watchdog technique and itsimplementation into a SoC is described.The software support for this system is alsooutlined in this section.

Next section reports how the EDMs associ-ated with the execution �ow guarantee it;these are characterized, either theoreticallyor by means of some experiments. For thoserequiring experiments, the memory model isdescribed in the corresponding subsection,along with the results obtained. The pa-per ends with the conclusions obtained andsome further research opportunities.

2. Background

A minimal set of basic terms taken from [5]is needed to understand the overall system.A branch-in instruction is an instructionused as the target address of a branch or callinstruction (for example, the �rst instruc-tion of a procedure or function). A branch-out instruction is an instruction capable tobreak the sequential execution �ow, condi-tionally or unconditionally (for example, aconditional branch or a procedure call in-struction). A basic block is a sequence ofinstructions with no branch-in instructionsexcept the very �rst one and no branch-out

120

instructions except possibly the last one.

A derived signature is a value assigned toeach instruction block to be used as refer-ence in the checking process at run-time.The term derived means the signature isnot an arbitrarily assigned value but cal-culated from the block's instructions. De-rived signatures are usually obtained xor-ing the instruction opcodes or using the op-codes to feed a Linear Feedback Shift Reg-ister (LFSR). These values are calculatedat compile time and used as reference bythe EDM to verify correctness of executedinstructions.

If signatures are interspersed or hashedwith the processor instructions the methodis generally known as Embedded SignatureMonitoring (ESM). A watchdog processoris a hardware EDM used to detect Con-trol Flow Errors (CFE) and/or corruptionof the instructions executed by the proces-sor, usually employing derived signaturesand an ESM technique. In this case itperforms signature calculations from theinstruction opcodes that are actually exe-cuted by the main processor, checking theserun-time values against their references. Ifany di�erence is found the error in the mainprocessor instruction stream is detected andan Error Recovery Mechanism (ERM) is ac-tivated.

The percentage of detected error is the errordetection coverage, and the time from theerror being active to the detection is theerror detection latency. With both valuesany EDM can be characterized.

A branch insertion error is the error pro-duced when the opcode of a non-branch in-

struction is corrupted and it is transformedinto a branch instruction; from a watchdogprocessor perspective, this error is detectedas a too early branch. A branch deletionerror is the error produced when the op-code of a branch instruction gets corruptedand the instruction becomes a non-branchinstruction; the watchdog detects this errorcondition as a too late branch.

Any error a�ecting a non-branch instruc-tion other than branch insertion errors, donot a�ect the execution �ow of the programand are not part of the structural integritychecking mechanisms.

3. Previous Work

In [7] a novel technique to embed signaturesinto the processor's instruction stream ispresented. Its main goal is the reductionof the performance impact of the watchdogprocessor and it is targeted to processorsincluded into embedded systems.

Using this technique, called ISIS (Inter-leaved Signature Instruction Stream), thewatchdog processor signatures are hashedwith the application processor's instruc-tions in the same memory area. Signa-tures are interleaved within instruction ba-sic blocks, but they are never fetched norexecuted by the main processor.

Signature control words (or simply signa-tures) are placed at the beginning of ev-ery basic block in the ISIS scheme (seeFig. 1). These references incorporate,among other checking mechanisms, the op-

121

Figure 1: ISIS signature control word and signature insertion process: (a) high-levellanguage snippet, (b) original blocks at assembly stage, and (c) after code is instrumentedwith signatures

code signature �eld: a polynomial CRC ofthe block instruction bits to detect the cor-ruption of any instruction (non-CFE errorsand branch insertion and deletion errorsas well). Using a polynomial redundancycheck 100% of single bit errors and a largepercentage of more complex error scenarioscan be detected.

Besides error detection capabilities ob-tained from the opcode signature, and dueto the fact that the block reference wordincludes the block length, branch insertionand branch deletion errors are detected.

The signature word encoding has been de-signed in such a way that a main processorinstruction can not be misinterpreted as awatchdog signature instruction. This pro-vides an additional check when the main

processor executes a branch instruction.This check, called Branch Start, consistsin the requirement to �nd a signature in-struction immediately preceding the �rstinstruction of every block. This also helpsto detect a CFE if a branch erroneously tar-gets a signature instruction, because the en-coding will force an illegal instruction ex-ception to be raised.

Under the assumption of single bit errors,the block length allows the watchdog pro-cessor to detect all branch insertion andbranch deletion errors. Additional checkingmechanisms related with the signature wordinstruction type and jump address guardbits are also included.

The Block Address is a check process thatuses one of the address check �elds in the

122

signature word (Block Origin Address orBlock Target Address) to verify the correct-ness of the address of the target instructionwhen a branch is taken. The di�erence be-tween the addresses of the branch instruc-tion and the target instruction is computedat compile time, and a checksum is calcu-lated and stored into the signature word.At run-time, when the processor breaks theexecution sequence taking a branch, the ac-tual addresses employed by the processorare used, inside the watchdog processor andfollowing the same algorithm used by thecompiler, to calculate another checksum. Inthe absence of errors, both must match; anymismatch will trigger the watchdog's errordetection procedure.

This two EDMs, Block Start and BlockAddress, form the basic elements used bythe watchdog processor to guarantee theintegrity of the processor's execution �ow.And the work presented in this paper showstheir error coverage characteristics, usingthem separately and combined.

To reduce performance overhead the mainCPU should not process signatures in anyway. With this objective in mind, the CPUis designed to skip an instruction per basicblock while maintaining the normal instruc-tion sequencing. These architectural modi-�cations create word gaps in the main pro-cessor instruction stream immediately fol-lowing branches and calls. A specializedcompiler uses these gaps to store watchdogsignatures words.

With this arrangement, two completely in-dependent interleaved instruction streamscoexist in our system: the application in-

struction stream, which is divided intoblocks and executed by the main proces-sor and the signature stream, used by thewatchdog processor.

Isolating the reference signatures from theinstructions fed into the processor pipelineresults in a minimal performance overheadin the application program. More informa-tion about this signature embedding tech-nique can be found in [7].

The ISIS technique has been implementedin the HORUS processor [8], a soft-coreclone of the MIPS R3000 [9] RISC processor(see Fig. 2). It is a four stage pipelined pro-cessor with a complete Memory Manage-ment Unit and instruction cache. The ex-ternal memory and peripherals are accessedthrough an AMBA AHB bus [10]. This pro-cessor is provided with a Memory Manage-ment Unit (MMU) to perform virtual tophysical address mapping, isolating mem-ory areas of di�erent processes and check-ing correct alignment of memory references.The watchdog processor is fed with the in-structions from the main processor pipelineas they are retired.

The original processor architecture hasbeen augmented with an ISIS watchdogprocessor. The instruction cache is modi-�ed to include two read ports to provide si-multaneous access to both processors (mainand watchdog processors).

The watchdog calculates run-time signa-tures at the same rate of the processorpipeline. When a block ends, these valuesare stored into a FIFO memory to decouplethe checking process. This FIFO allows alarge set of instructions to be retired from

123

Figure 2: HORUS processor and overall system architecture

the pipeline while the watchdog is waitingfor the block reference signature word. Ina similar way, the watchdog can empty theFIFO while the main processor pipeline isstalled due to a memory operation. Whenthis FIFO memory is full, the main proces-sor is forced to wait for the watchdog check-ing process to read some data from it.

4. HORUS Compiler Sup-

port

The GNU gcc compiler already provides aport to target MIPS processors. As itssource code is freely available it was thenatural starting point to provide the re-quired software support for the HORUSprocessor. The gas program (GNU Assem-bler) has the responsibility of the assemblystage in the compilation process, after pro-gram optimization passes and before the �-nal linker stage.

The gas program and its supporting li-

braries have been modi�ed to support thearchitecture of HORUS and its use of theISIS technique via command line switches.As instructions are assembled,

1. If the current instruction is the targetof a branch instruction, a new blockstarts and so its signature it is inserted.

2. If the current instruction is a branch,the next instruction will �ll the branchdelay slot and end the current block.

With this information and the opcode bitsof the program instructions the assemblercan calculate block signatures and insertthem at appropriate places. No provisionsare needed to modify the target of a branchor call instruction, as all instruction ad-dresses are referenced using symbolic names(labels).

The software splits large sequences of in-structions to accommodate the generatedblocks to the length �eld of the signaturecontrol word. Reducing the number of in-

124

structions in a block increases the mem-ory requirements, but it also reduces thelatency from the error activation to its de-tection. While the length �eld would allowfor blocks of up to 16 instructions, the ac-tual block length could be smaller due toseveral reasons, most noticeably:

1. One of the instructions in the sequenceis the target of a branch instruction.In this case, a signature must precedethis instruction, so a new block mustbe created.

2. The use of variant frags. A variantfrag is a combination of two di�erentsequences of instructions generated bythe assembler to solve the same task.For example, to store the address ofa variable into a register, several se-quences of instructions (and with dif-ferent lengths) are possible using theMIPS instruction set, depending on theavailability of a register pointer. If thesymbol can not be resolved at assem-bly time, both sequences are generated.Obviously, only one of these would re-main in the �nal executable, but thedecision is delayed until the symboladdress is resolvable. As the blocksize must be determined at the timeof instruction generation, the approachtaken has been conservative and the as-sumption that the larger sequence willremain is always followed. By the timethe symbol is resolved, the blocks arealready formed and their size can notbe changed, so if the short sequence is�nally selected the block will be shorterthan 16 instructions.

5. Error Detection Cover-

age of CFEs

The Block Start EDM can be theoreti-cally characterized, and its error coverage is100% as stated in proposition 1. The BlockAddress EDM requires some experiments tobe carried out, as detailed below.

Proposition 1. The Block Start checkingmechanism ensures that all CFEs targetingan instruction other than the �rst instruc-tion of a block are detected.

Proof. A signature precedes the �rst in-struction of a block. The watchdog pro-cessor uses the block initial address (beingcorrect or not) as a memory reference toget the block's signature, retrieving it fromthe memory location immediately precedingthis initial address. Given the fact that thebit patterns of signature words are selectednot to match any instruction of the mainprocessor, there are no instructions of themain processor that may be misinterpretedby the watchdog processor as a block signa-ture.

So, in the case of a CFE targeting an in-struction other than the �rst instruction ofa block, the contents of the immediatelypreceding memory location is a processorinstruction and not a signature word. Its bitpattern will not match any signature type inthe watchdog processor, and the mismatchwill trigger the error detection.

Run-time calculation errors inside the mainprocessor are not CFEs except if the incor-

125

rect value is an instruction address. Tak-ing a branch or returning from a procedure,where a target instruction address must becalculated or retrieved from memory, areexamples of such calculations. The opcodesignature can not cover those calculations,as the original instruction is not corrupted.

Assessment of the e�ectiveness of the BlockAddress checking mechanism coupled withthe Block Start check can only be per-formed by means of some kind of experi-mentation.

5.1. Experiment Setup

To determine the error detection coverageof EDMs applicable to CFEs a simulationmodel of the address calculation process hasbeen created. This model mimics the per-formed operations of the actual processor atthe execution of branches. Injecting faultsinto the model an erroneous target addressis obtained and we are able to determine ifthe EDMs would detect it.

The simulation model consists of a largearray of elements representing the proces-sor's memory. Each element represents ablock of sequential instructions with startaddress, length, signature, type of branchinstruction, target address, etc. The typeof branch instruction is important, as theaddress calculation process in the MIPS ar-chitecture is completely di�erent if the in-struction is a conditional branch or an un-conditional jump. The former uses a pro-gram counter relative address and the lateran absolute address.

Injecting a fault into the address calcula-tion process in this model is as simple asrandomly picking up the origin block, andsimulating the e�ect of a single bit error atthe branch.

Comparing the new, erroneous target ad-dress with the original one the fault mask-ing probability is determined. A fault ismasked if the calculation performed pro-duces the same result as if there is no fault.

Using the erroneous address to compute theaddress guard bits and comparing them tothose bits stored into the block signatureword, the error detection probability of theBlock Address EDM is obtained. The er-ror detection probability of the Block StartEDM is obtained performing a search overthe memory model to verify if the erroneousaddress matches the start of a block or not.

To simulate the e�ect of a single bit error inthe address calculation process, a single bitof one of the operands or a single bit of theresult is altered. Which value and whichbit are chosen randomly. If it is an operandwhat is modi�ed, the bit is changed beforethe target address is calculated. If it is theresult, it is modi�ed after the calculationis performed. Thus, a single bit error inthe operands may propagate to adjacent bitpositions to simulate the e�ect of a single ormultiple bit error.

A synthetic workload is created �lling thememory with blocks of random length, fol-lowing a uniform distribution between 3 and17 words. While the shortest block in theoriginal MIPS architecture is 2 instructionslong (the branch and the instruction at thebranch delay slot), this block is augmented

126

Figure 3: XOR tree to obtain checksum bits for a 3-bit address guard

with the block signature in HORUS (a sig-nature has the same length of an instruc-tion, it is a 32-bit word). The ISIS-modi�edgcc compiler limits the block length to ac-commodate it to the length �eld of the sig-nature word, so no block larger than 17words (16 instructions plus the signature)is allowed in our system. These length val-ues match the mean length of sequential in-structions, claimed to be between 7 and 8[11].

Once the memory is �lled, for each block thetype of instruction at its end and the tar-get block are chosen randomly. With thisinformation, the address guard bits are cal-culated using the same algorithm internallyused by the compiler and stored into theblock structure for future reference.

This algorithm starts calculating the ad-dress di�erence between the branch and thetarget instructions. This 32-bit value isthen compressed using a simple xor tree toobtain the address guard bits. Although theoriginal proposal of ISIS reserves 3 bits forsuch guard, the xor tree is easily expand-able to accommodate larger �elds if spaceis available.

Figure 3 shows a representation of the xortree for a guard �elds of bits (g2g1g0). Xor-

ing alternating bits help the watchdog pro-cessor to detect multiple bit errors, where asingle bit error into an operand propagatesinto a sequence of bit errors at the calcu-lated result. Note that the 32-bit value cal-culated above (V31..0) is padded with zeroeswhere necessary.

5.2. Results

Several fault injection campaigns have beencarried out. Each campaign consists inthe injection of 50,000 errors, and the ex-periments have been repeated a number oftimes with di�erent random seeds to obtaintheir typical deviation, a statistical disper-sion measurement.

Table 1: Block Start error coverageMemory Mean Typ.size (%) deviation

64 Kbytes 45.14 0.287256 Kbytes 50.44 0.3391 Mbytes 56.53 0.1332 Mbytes 59.21 0.212

To analyze the impact of the address guard�eld size, guards from 2 to 6 bits havebeen used in each experiment. The mem-ory used by the application program has

127

Table 2: Block Address error coverageGuard size 2 bits 3 bits 4 bits 5 bits 6 bits

Mean (%) Mean (%) Mean (%) Mean (%) Mean (%)

Memory size Typ. dev Typ. dev Typ. dev Typ. dev Typ. dev

64 Kbytes 96.52 98.31 98.70 99.29 99.37

0.199 0.088 0.094 0.023 0.038

256 Kbytes 96.74 98.42 99.01 99.31 99.37

0.055 0.042 0.042 0.018 0.028

1 Mbytes 96.81 98.48 99.15 99.36 99.40

0.114 0.025 0.056 0.016 0.028

2 Mbytes 96.79 98.49 99.15 99.36 99.40

0.060 0.054 0.020 0.040 0.034

been changed from 64Kbytes to 2Mbytes.A larger memory size theoretically increasesthe possibility of an erroneous branch totarget the start of a block, and the errorbeing undetected by the Block Start check.

Other elements incorporated into theHORUS processor incorporating checkingmechanisms to detect CFEs but not ex-plicitly included into the watchdog proces-sor have not been included into our experi-ments as they do not characterize the errorcoverage we're trying to obtain from the in-clusion of the watchdog. For example, theMemory Management Unit would triggeran exception if a branch targets a non-usedmemory area. Another check used by themain processor covering the same type oferrors is the alignment check; all instruc-tions fetched from memory must be alignedon a word boundary, or an exception is trig-gered. This means the results shown do notcorresponds to the system error detectioncoverage, but only the coverage of the afore-mentioned EDMs.

The Table 1 summarizes the error coverageobtained with the Block Start mechanism

alone, for each memory size.

As the results outline, the memory size hasthe inverse e�ect of what is theoreticallyexpected. A larger memory increases, al-though moderately, the error coverage, de-spite the fact that there are more possibili-ties to target a block start erroneously. Thiscan be explained by the fact that, at thesame time, a larger memory means thereare more possibilities the erroneous addressfall inside the covered memory area.

The Table 2 shows the error coverage ob-tained with the Block Address mechanismfor di�erent address guard bits and mem-ory sizes, and the combined error coverageis show in Table 3.

Another interesting result from the experi-ments carried out is the error length distri-bution, shown in Table 4. This table showshow a single bit error may propagate into amultiple bit error as the address calculationprocess takes place. Although data showncorresponds to one of the experiments only,the other experiments o�er similar resultsand the data values have been omitted to

128

Table 3: Block Start and Block Address combined error coverageGuard size 2 bits 3 bits 4 bits 5 bits 6 bits

Mean (%) Mean (%) Mean (%) Mean (%) Mean (%)

Memory size Typ. dev Typ. dev Typ. dev Typ. dev Typ. dev

64 Kbytes 97.41 98.70 98.73 99.36 99.37

0.139 0.084 0.092 0.023 0.038

256 Kbytes 97.97 98.68 99.31 99.34 99.37

0.032 0.055 0.018 0.016 0.027

1 Mbytes 97.99 98.75 99.34 99.38 99.40

0.070 0.045 0.044 0.013 0.027

2 Mbytes 98.55 99.27 99.35 99.38 99.94

0.038 0.024 0.026 0.045 0.015

Table 4: Error length distributionError Mean Typ.length (%) deviation

0 (masked) 16.45 0.1491 71.45 0.0792 5.49 0.0753 2.73 0.0894 1.45 0.0475 0.85 0.0226 0.50 0.0167 0.39 0.0198 0.31 0.0229 0.30 0.02010 0.01 0.006

eliminate the redundancy. Error lengthsabove 10 bits have been also eliminated byits negligible impact.

As expected, the error length concentratesaround single error bits, but percentages ofmasked errors, and multiple bit errors rang-ing from 2 to 4 bits are also noticeable.

6. Conclusions

The checking mechanisms to detect CFEsof the ISIS technique have been discussed,and its implementation on the HORUS pro-cessor has been outlined. This practicalimplementation has been complemented bya modi�ed version of the ubiquitous C-language compiler gcc, to automatically in-sert signatures into the application pro-gram, lightening the programmer of mostsystem reliability details.

Although the small number of bits reservedto check branch addresses could have gen-erated some doubts about the e�ectivenessof the error detection mechanisms, this hasbeen proven in contrary by the injection offaults into a model of the memory subsys-tem.

The model represents the contents of eachblock as a sequence of instructions precededby the block's signature, and the addressand length of each block is computed andstored for future reference. Single-bit errorshave been injected into the model, and the

129

Block Start and Block Address EDMs haveshown their e�ectiveness detecting CFEs.

Error coverage can be improved using anaddress guard �eld larger than the origi-nal 3-bit proposal. This requires reducingother checking �elds, the opcode signaturebeing the most promising alternative. Re-ducing this �eld could also reduce the errorcoverage of the associated mechanism (notdescribed in this work) so the reduction re-quires further analysis.

Another interesting result depicted in thispaper is the error length distribution inthe address calculation process. Althoughsingle-bit errors are injected into the model,the arithmetic circuitry used in the addresscalculation process when a branch is takenhelps the error to propagate as a multiple-bit error at the computed value. The er-ror length distribution can be applied toother architectures using absolute or pro-gram counter relative addressing modes andwould help future researchers to take intoaccount this propagation when designingerror detection mechanisms.

Acknowledgements

This work is supported by the Ministerio deEducación y Ciencia of the Spanish Govern-ment under project TIC2003-08106-C02-01.

Bibliography

[1] Avresky, D., Grosspietsch, K. E., John-son, B. W., Lombardi, F.: Embedded

fault tolerant systems. IEEE Micro Mag-azine, (1998) 18(5):8�11

[2] IEEE Std. 1076-1993: VHDL LanguageReference Manual. The Institute of Elec-trical and Electronics Engineers Inc.,New York (1995)

[3] Gunne�o, U., Karlsson, J., Torin, J.:Evaluation of Error Detection SchemesUsing Fault Injection by Heavy-ion Ra-diation. In Proceedings of the 19th FaultTolerant Computing Symposium (FTCS-19), Chicago, Illinois (1989) 340�347

[4] Czeck, E.W., Siewieorek, D.P.: E�ectsof Transient Gate-Level Faults on Pro-gram Behavior. In Proceedings of the20th Fault Tolerant Computing Sympo-sium (FTCS-20), NewCastle Upon Tyne,U.K. (1990) 236�243

[5] Ohlsson, J., Rimén, M., Gunne�o, U.:A Study of the E�ects of Transient FaultInjection into a 32-bit RISC with Built-in Watchdog. In Proceedings of the 22thFault Tolerant Computing Symposium(FTCS-22), Boston, USA (1992) 316�325

[6] Siewiorek, D.P.: Niche Sucesses toUbiquitous Invisibility: Fault-TolerantComputing Past, Present, and Future.In Proceedings of the 25th Fault Toler-ant Computing Symposium (FTCS-25),Pasadena, USA (1995) 26�33

[7] Rodríguez, F., Campelo, J.C., Serrano,J.J.: A Watchdog Processor Architec-ture with Minimal Performance Over-head. Lecture Notes in Computer Sci-ence (LNCS Series), Springer-Verlag ed.(2002) vol. 2434, 261�272

130

[8] Rodríguez, F., Campelo, J.C., Serrano,J.J.: The HORUS Processor. In Proceed-ings of the XVII Conference on Design ofCircuits and Integrated Systems (DCIS2002), Santander, Spain (2002) 517�522

[9] MIPS32 Architecture for Programmers,volume I: Introduction to the MIPS32Architecture. MIPS Technologies (2001)

[10] AMBA Speci�cation rev2.0. ARMLimited (1999)

[11] Hennessy, J.L., Patterson, D.A.:Computer Architecture. A Quantita-tive Approach (2nd edition). Morgan-Kau�mann Publisher (1996)

131

Reducing the vhdl-based fault injection simulation

time in a distributed environment


Grupo de Sistemas Tolerantes a Fallos (Fault Tolerant Systems Group)Dept. Informática de Sistemas y Computadores (DISCA)

Universidad Politécnica de Valencia, 46022-Valencia (Spain)email: {prodrig, jcampelo, jserrano}@disca.upv.es

Abstract

In this paper we present a distributed sim-ulation toolkit specially developed to helpthe researcher in the dependability assess-ment studies where the use of fault injec-tion techniques into complex VHDL mod-els are involved. Two mechanisms, restart-ing the simulator and restoring the modelstate, are evaluated. The tool architectureand results from experiments carried out ona complex SystemOnChip are presented todemonstrate its usefulness. These resultsclearly show that the selection of the propermechanism results in a noticeable reductionof the simulation time when using our toolcompared with a general-purpose workloaddistribution application.

1. Introduction

With the advent of modern technologies inthe �eld of programmable devices and enor-mous advances in the software tools usedto model, simulate and translate into realhardware almost any digital system, the ca-pability to design a whole System-On-Chip(SoC) has become a reality even for smallcompanies. With the widespread use of em-bedded systems in our everyday life, ser-vice availability and dependability concernsfor these systems are increasingly important[1].

A SoC is usually modelled using a HardwareDescription Language (HDL) like VHDL.It allows a hierarchical description of thesystem and the designed elements intercon-nect much the same way as they would ina graphical design �ow, but using an arbi-trary abstraction level. It also provides IOfacilities to easily incorporate test vectors,and language assertions to verify the correct

behaviour of the model during the simula-tion.

Every Error Detection Mechanism (EDM)incorporated into the SoC to increasethe system dependability must be charac-terised. This characterisation includes theprobability to detect errors (coverage) andthe time from fault activation to error de-tection (latency). Fault injection (FI) is aconsolidated technique [2, 3] to assess mech-anism's error detection properties, and it isalso used to determine how errors propagatethrough the system, revealing which are thecritical elements in the designed system.

Fault injection means a deliberated inser-tion of faults into a system in order to anal-yse its behaviour in the presence of errorsand is de�ned in [3] as: �The dependabilityvalidation technique that is implementedby means of controlled tests where the ob-servation of the behaviour of the systemin presence of faults is explicitly inducedby the deliberate introduction (injection)of faults in the system�. Di�erences be-tween fault injection and other experimen-tal techniques are due to FI involves thewhole system, both its physical component(hardware) and its behavioural component(software).

Fault injection may be performed duringthe design phase using a simulation systemmodel (simulation-based fault injection) orduring the prototype phase injecting faultsin a system prototype or in the �nal system.When a simulation-based fault injection isused, faults must be added to the systemmodel. To be useful, these faults shouldsimulate the e�ect of real faults on the real

system.

In error propagation studies, the trace ofthe injected model is compared against afault-free simulation trace called the goldenrun in order to show if the fault has acti-vated itself generating an error and the er-ror propagation path. In EDM's character-isation studies, the trace from the injectedsimulation must also include enough infor-mation from the EDM itself to determine ifthe error has been properly detected or not.

Several simulation-based fault injectiontechniques have been proposed in the lit-erature [4, 5, 6, 7, 8, 9] that use a HDL todescribe the system and the faults to be in-jected. The use of simulator commands isan injection technique based on the use ofcommands to force the value of some sig-nals in the VHDL model, thus generating afault. As the fault is injected into the modelat simulation time, the original model needsno modi�cation or recompilation, makingthis technique a popular solution. This isthe approach used by the tool presentedhere.

In order to achieve an adequate con�dencelevel in the dependability results, the sta-tistical analysis demands a large set of sim-ulations (several thousands) to be carriedout, even after pruning techniques [10, 11]are applied. In this simulation set, called anexperiment campaign, we must decide whatkind of faults must be considered, where toinject them, and when during the simula-tion run. Every run is the simulation of themodel in the presence of a single fault.

This paper is structured as follows. In thenext section, the motivation for this work

133

and its objective is presented followed bythe description of the developed tool. Then,the SoC used for our experiment campaignsis brie�y described, and the measures takenfor di�erent environments are presented.After this, the results of such measures areanalysed, �nishing with the work conclu-sions.

2. Motivation

The massive simulation workload for thefault injection campaign naturally �ts in adistributed environment. Simulation runsare independent of each other, so theycan be managed as di�erent simulator ex-ecutions. This is the approach used bygeneral-purpose tools to achieve automaticresource sharing and load balancing on thiskind of complex, heterogeneous environ-ments [12, 13]. They help the SoC designerto carry out the simulations in a distributedenvironment and collect the result �les, butthey do not cover the campaign data gen-eration or the analysis of the simulation re-sults. For these tasks, a specialised tool isneeded [4, 5, 6, 8, 14] easily coupled witha distributed simulation environment (if ageneral-purpose tool for the distributed en-vironment is used).

The general-purpose tools use a workloadmodel that translates every user task (asimulation run in our case) into a set ofbatch program executions. This makes nouse of the capabilities of current availablesimulators [15], loosing a speed-up oppor-tunity. If the speed-up lost is su�cientlylarge, it can justify the use of a specialised

simulation management tool.

With a powerful simulator, a restart com-mand exists to shift the simulated modeltime to zero, allowing several simulationsto be carried out without the overhead of�nishing the simulator program and execut-ing it again. It is also possible the use of arestore command, shifting the model to apreviously saved state, simulating from therestored time on. This restoring mechanismmay be used to trim fault injection simula-tions, as the model behaves exactly as fault-free until the time the fault is injected.

Our research group has already developeda fault-injection tool, called VFIT [14]. Itis powerful and mature fault injection toolfor VHDL models. It includes a rich setof features in the �eld of fault injection,but lacks the distributed simulation capa-bilities mentioned above, so we must resortto a general-purpose tool for the simulationworkload.

The software toolkit presented here ful�lsall these questions, incorporating a simula-tion framework with full use of VHDL simu-lator commands to speed-up the simulationprocess. We have called this set of softwareelements the FIASCO toolkit (Fault Injec-tion Aid Software COmponents). Amongthe features you can �nd in this toolkit are:

Automated processes to generate theexperiment campaign data. The userselects the signals to be traced in thesimulation from a hierarchical view ofthe SoC model. A graphic interface isalso used to specify the fault injectionparameters (number and type of faults

134

to be injected, the distribution func-tions of the injection start time andlength) and simulation options (sim-ulation time, use of restart or restorecommands, etc.).

Use of a heterogeneous set of simu-lation hosts to distribute the simula-tion runs. FIASCO makes full use ofthe simulator commands to speed upthe simulations. To avoid system over-heads the simulator program is exe-cuted once per host. As the model doesnot change, a very fast restart com-mand may be issued for every simula-tion run.

Automatic collection of the result �lesand analysis, obtaining the statisticaldata the user has previously requested.A speci�c language has been developedto let the user express relationshipsfrom the golden run and the injectedsimulation in order to generate the ex-periment's dependability formulae.

The goal of the FIASCO toolkit and theexperiments presented here is to determineif noticeable performance improvements arepossible with the use of a specialised simu-lation framework, before including this fea-ture in the next release of VFIT. As thework presented here is based on an inter-preted language, it is expected that thecompiled nature of VFIT will even increasethe performance obtained with FIASCO.

3. The FIASCO toolkit

Modelsim is a very popular digital simula-tor from a mayor EDA Company that runson a large variety of system architectures[15]. The graphic interface is built aroundan embedded Tcl/TK interpreter [16] givingthe user an unlimited expandability withthe use of tcl scripts that can be dynami-cally loaded and executed. These tcl scriptsare text �les that can add new commandsor modify existing ones. The user can evencreate his own graphic interface using theTK toolkit. These capabilities are exploitedby FIASCO to graphically assist the user inhis dependability research.

This commercial simulator provides thecommands mentioned in the previous sec-tion to speed up a set of simulations from asingle program execution. These commandscan be entered from a text console or froma batch script, and will be processed by themodelsim internal tcl interpreter. The sim-ulator includes text-only batch simulationsexecuting commands from a text script, andthis is the approach used in our simulationhosts.

The FIASCO toolkit is built around two tclscripts that integrate into modelsim onceloaded and communicate themselves withTCP/IP sockets using a client/server pro-tocol (see Fig. 1). The client (calledFIASCO-C) generates the campaign data,performs the statistical analysis and con-trols the simulation in the distributed en-vironment. The server (FIASCO-S) exe-cutes within the modelsim simulator in eachsimulation host. It receives simulation re-

135

Figure 1: FIASCO toolkit component interconnection.

quests from the client and uses the sim-ulator commands to carry them out. AsFIASCO-S is an integral part of the simu-lator, it can be used on any system architec-ture supported by the simulator itself. TheFIASCO-C script incorporates the experi-ence gained and the technologies developedwithin the development of our group's toolVFIT mentioned before.

Sockets are used as the communicationchannel between the client and the simula-tion servers. To supports network failures,the connection between client and serveruses an asymmetric protocol that is state-less in the server side. Steps followed tocarry out a simulation campaign are as fol-lows. First, the client (step a) assigns aset of simulations to every idle server. Theserver simulates and locally stores the re-sult �les (step b), signalling the client (stepc) when the simulations have �nished. Theclient reacts assigning a new set of simula-tions for the idle server, and collecting theresult �les (step d) from the simulation hostusing the ftp protocol. Once all the result

�les are available, the client analyses them(step e) to obtain the dependability statis-tics.

If the network fails for some reason, theconnection socket between server and clientcloses, but the server continues simulating.If it can not signal the end of requested sim-ulations to the client, it simply waits forthe client to reconnect. When the clientconnects with a simulation server it �rst re-quests the server status to determine if theserver is idle or not, so the protocol on theclient side can resynchronise accordingly.

To ensure interoperability between clientand servers, result �les are plain text �les.However, text-based trace �les tend to bevery large due to the �le format the simula-tor uses (more than 40 Mbytes in our largestexperiments � see next section for a descrip-tion). To avoid running out of disk andwasting time in the ftp transfers, we havedeveloped a new propietary format. Thisformat is still human readable text, but itachieves a reduction ratio between 19 and26 (depending on the trace itself). The �-

136

Figure 2: SoC architecture under test.

nal �le is then compressed using a standardutility before the simulation is considered�nished.

3.1. Simulation speed-up tech-

niques

The FIASCO-S component is executed onlyonce per host, and the simulator programis kept alive from simulation to simulation.This is radically di�erent from a general-purpose approach, which would execute thesimulator for every injection, with the cor-responding OS overhead to load and unloadthe simulator program.

The user has the �exibility to decide thatonly the restart command is used. In thiscase, all simulations are carried out fromtime zero. The OS overhead to load and un-load the program is substituted by the timethe simulator needs to initialise the modelstate.

If the user decides to use the restore com-mand, a number of state �les (checkpoints)are generated before the �rst injection.When a simulation is started, the closeststate is selected trimming the simulationtime. However, it must be taken into ac-count that a restore is much heavier thana simple restart, as the model state mustbe retrieved from a disk �le. There also ex-ists the initial overhead to create the check-points that must be also accounted.

4. Test description

To test the FIASCO toolkit we have used acomplex SoC (see Fig. 2). This comprises aMIPS R3000 processor [17], an instructioncache and a set of AMBA bus elements [18]to connect the SoC with the external mem-ories. We use a watchdog processor similarto the ones described in [19, 20] as our EDMusing the Embedded Signature Monitoring

137

Table 1: No checkpoints experiments - total and mean simulation times (in seconds).Total simulation time Mean simulation time

Simulated clock cycles Simulated clock cycles

n 100 500 1500 3000 100 500 1500 3000

1 22,61 77,11 235,55 475,68 22,61 77,11 235,55 475,68

10 179,44 733,64 2308,26 4699,60 17,944 73,364 230,826 469,96

50 875,57 3649,85 11597,22 23631,73 17,5114 72,997 231,9444 472,6346

100 1742,93 7272,51 23204,49 47729,94 17,4293 72,7251 232,0449 477,2994

Table 2: 10 checkpoints experiments - total and mean simulation times (in seconds).Total simulation time Mean simulation time


n 100 500 1500 3000 100 500 1500 3000

10 212,68 683,97 2030,9 4073,23 21,268 68,397 203,09 407,323

50 850,51 3197,35 10098,24 20304,64 17,0102 63,947 201,9648 406,0928

100 1647,03 6295,28 19743,52 40510,87 16,4703 62,9528 197,4352 405,1087

technique [20].

All the elements mentioned above have beendeveloped in our group as synthesizableRTL models. We have also added a tracefacility inside the processor's model, anda VHDL testbench to incorporate the ex-ternal memories into the model. The pro-gram space is �lled at start-up time withthe Eratosthenes sieve prime number gen-erator program inside an in�nite loop. Thisallows us to arbitrarily change the numberof simulated CPU clock cycles.

We have carried out a simple experimentin order to estimate our tool's performanceimprovement, studying the simulation timefor a single simulation server for di�erenttestbench con�gurations. Both server andclient execute onto the same machine, aPC box (a 1,1 GHz Athlon processor with512 Mbytes of DDR-SDRAM) using LinuxMandrake. This arrangement eliminates

the transfer of the result �les from server toclient, restricting experiment times to sim-ulation

We vary the number of injections to be car-ried out between 10 to 100 to evaluate thenumber of injections a single server shouldperform in a medium to large machine pool.To study the simulation time for di�er-ent complexity levels, we simply change thenumber of clock cycles to be simulated from100 to 3000. These simulation cycles trans-late in simulation times from 22 seconds to8 minutes for a single injection.

If we call T to the mean time to executethe simulator program for a single injec-tion and n to the number of injections, alower bound for the simulation time usinga general-purpose tool is simply derived asn × T . To evaluate the usefulness of us-ing the restart and restore simulator com-mands, three di�erent types of simulations

138

Table 3: 50 checkpoints experiments - total and mean simulation times (in seconds).Total simulation time Mean simulation time


n 100 500 1500 3000 100 500 1500 3000

10 425,13 895,02 2236,12 4253,16 42,513 89,502 223,612 425,316

50 1065,82 3380,29 10138,02 20325,9 21,3164 67,6058 202,7604 406,518

100 1836,89 6421,62 20026,1 40209,05 18,3689 64,2162 200,261 402,0905

Table 4: Simulation times (normalised).No checkpoints 10 checkpoints 50 checkpoints

Simulated clock cycles Simulated clock cycles Simulated clock cycles

n 100 500 1500 3000 100 500 1500 3000 100 500 1500 3000

10 0,79 0,95 0,98 0,99 0,94 0,89 0,86 0,86 1,88 1,16 0,95 0,89

50 0,77 0,95 0,98 0,99 0,75 0,83 0,86 0,85 0,94 0,88 0,86 0,85

100 0,77 0,94 0,99 1,00 0,73 0,82 0,84 0,85 0,81 0,85 0,85 0,85

are performed: using the restart commandonly (no checkpointing) and using 10 and50 checkpoints to restore.

5. Experimental results

The measured times for the simulations us-ing no checkpoints are shown in the Table1. The total time is the actual measure-ment and the mean simulation time is de-rived from this value and the number of in-jections. Tables 2 and 3 show the same mea-surements for the experiments using check-points.

We normalise the mean simulation times us-ing T as the simulation unit to compareperformances of FIASCO and a general-purpose tool. The value of T is the timefor a single simulation (row n = 1 in Table1 above). Normalised values are shown inTable 4.

From Table 4 is evident the bene�t fromusing a specialised simulation tool likeFIASCO. The performance improvementgrows up to 23 % using no checkpointsfor short simulations and up to 15 % us-ing checkpoints for long simulations. In-terestingly, increasing the number of check-points does not produce a noticeable per-formance increment. Although simulatedtime is shorter as the number of checkpointsincrease, because checkpoints are closer tothe time the fault must be injected, thisdoes not translate in an overall improve-ment. This may be due to the increase inthe initial overhead to generate more check-points.

The checkpoint generation overhead can beeven counterproductive when injections de-mand a short time to simulate. If injec-tions can be simulated fast, the simulationtime saved is negligible and does not com-pensate the large time required to generatethe checkpoints. These results do not agree

139

with the ones presented in [11]. The authorsargue that checkpointing is always bene�-cial, and that it is only necessary to trade-o� the optimum number of checkpoints.Those results are however obtained froma much simpler model (an 8-bit microcon-troller with no cache memories or MMU), sothe checkpoints need much less time to begenerated and restored and the generationoverhead can be easily compensated. Thecomplexity di�erences in the models usedmay explain this discrepancy.

6. Conclusions

In this paper, a specialised distribution sim-ulation framework for dependability assess-ment using fault injection into VHDL mod-els has been presented. The performanceimprovement obtained with our FIASCOtoolkit using a simple experiment has beenalso presented and the results are very in-teresting.

FIASCO is especially well suited for faultinjection experiments needing a large num-ber of short simulations, as it is the casefor dependability assessment of low-latencyerror detection mechanisms. The perfor-mance bene�t ranges from 23 % to 27 % persimulation, depending on the total numberof simulations a single host must performand the speed-up technique used. These�gures prove the usefulness of such a toolin the dependability assessment �eld.

Not all simulation cases bene�t so much,however. Performance improvements dropto 15 % for large simulations when a single

host must carry out a large number of sim-ulations and is a moderate 10 % for a largeenterprise pool where every host has a fewsimulations to solve.

Acknowledgements

This work is partially supported by theSpanish Government's Comisión Intermin-isterial de Ciencia y Tecnología under theproject reference CICYT TAP99-0443-C05-02.

Bibliography


[2] Laprie, J., C.: Dependable computing: con-cepts, limits and challenges. Proc. of the 25thFault Tolerant Computing Symposium (FTCS-25), 42-54, Pasadena, California, 1995.

[3] Arlat, J., Aguera, M., Amat, L., Crouzet, Y.,Fabre, J. C., Laprie, J. C., Martins, E., Pow-ell D.: Fault injection for dependability vali-dation: A methodology and some applications.IEEE Transactions on Software Engineering,166-182, vol. 16, 1990.

[4] Boue, J., Petillon, P., Crouzet, Y.: MEFISTO-L: a VHDL based fault injection tool for the ex-perimental assessment of fault tolerance. Proc.of the 28th Fault Tolerant Computing Sympo-sium (FTCS-28), pp. 168-73, 1998.

[5] Jenn, E., Arlat, J., Rimen, M., Ohlsson, J.,Karlsson, J.: Fault Injection into VHDL Mod-els: The MEFISTO Tool. Proc. of the 24thFault Tolerant Computing Symposium (FTCS-24), pp. 66-75, 1994.

140

[6] Gil, D., Baraza, J. C., Busquets, J. V., Gil,P. J.: Fault Injection with simulation in VHDLand application to a simple microcomputer sys-tem. Proc. of the 5th International Conferenceon Advanced Computing, pp. 466-474, 1997.

[7] Sieh, V., Tschäche, O., Balbach, F.: VHDL-based Fault Injection with VERIFY. TR-5/96, University of Friedrich-Alexander, Com-puter Architecture Department, Erlangen-Nuremberg, 1996.

[8] Sieh, V., Tschäche, O., Balbach, F.: VERIFY:Evaluation of Reliability Using VHDL mod-els with Embedded Fault Descriptions. Proc. ofthe 27th Fault Tolerant Computing Symposium(FTCS-27), pp. 32-36, 1997.

[9] Gil, D., Baraza, J. C., Busquets, J. V., Gil,P. J.: Fault Injection into VHDL models: Anal-ysis of the Error Syndrome of a MicrocomputerSystem. Proc. of the 24th Euromicro Confer-ence, pp. 418-424, 1998.

[10] Berrojo, L., González, I., Corno, F., Sonza,M., Entrena, L., Lopez, C.: New Techniques forSpeeding-up Fault-Injection Campaigns. Proc.of the Design, Automation & Test in Europe(DATE 2002), pp. 847-852, Paris 2002.

[11] Parrotta, B., Rebaudengo, M., Sonza, M., Vi-olante, M.: Speeding-up Fault-Injection Cam-paigns in VHDL models. Proc. of the 19th Intl.Conference on Computer Safety, Reliability &Security (SAFECOMP 2000), pp. 27-36, Rot-terdam 2000.

[12] Basney, J., Livny, M.: Deploying a HighThroughput Computing Cluster. High Perfor-mance Cluster Computing, vol. 1, R. Buyya(Editor), Prentice Hall 1999.

[13] Zhou, S., Wang, J., Zheng, X., Delisle, P.:Utopia: A load sharing facility for large, hetero-geneous distributed computing systems. Uni-versity of Toronto, Computer Systems ResearchInstitute, Toronto 1992.

[14] Baraza, J. C., Gracia, J., Gil, D., Gil, P. J.:A Prototype of a VHDL-Based Fault InjectionTool. Description and Application. Journal ofSystems Architecture, special issue on Defectand Fault Tolerance in VLSI Systems, to ap-pear.

[15] Modelsim SE v5.5b Command Reference.Model Technology Inc., 2001.

[16] Welch, B.: Practical Programming in Tcl/TK,3rd edition. Prentice Hall, 2001.

[17] MIPS32 Architecture for Programmers, vol-ume I: Introduction to the MIPS32 Architec-ture. MIPS Technologies, 2001.



[20] Ohlsson, J., Rimén, M., Gunne�o, U.: AStudy of the E�ects of Transient Fault Injec-tion into a 32-bit RISC with Built-in Watch-dog. Proc. of the 22th Fault Tolerant ComputingSymposium (FTCS-22), 316-325, Boston, Mas-sachusetts, 1992.

141

A Distributed Simulation Environment for Fault

Injection Analysis on SoC Models


Grupo de Sistemas Tolerantes a Fallos (Fault Tolerant Systems Group)Departamento de Informática de Sistemas y Computadores (DISCA)

Universidad Politécnica de Valencia, 46022-Valencia (Spain)email: {prodrig, jcampelo, jserrano}@disca.upv.es

Abstract

In this paper, we present a distributed sim-ulation environment specially developed tohelp the researcher in dependability assess-ment studies that involve the use of faultinjection techniques into complex VHDLmodels. The tool architecture and someresults from experiments carried out ona complex SystemOnChip are presented.These results show a noticeable reductionof the simulation time when using our toolcompared with a general-purpose workloaddistribution application.

1. Introduction and Moti-

vation

With the advent of modern technologiesin the �eld of programmable devices andenormous advances in the software tools

used to model, simulate and translate intoreal hardware almost any complex digitalsystem, the capability to design a wholeSystem-On-Chip (SoC) has become a re-ality even for small companies. With thewidespread use of embedded systems in oureveryday life, service availability and de-pendability concerns for these systems areincreasingly important [1].

Fault injection (FI) is a technique that isbeing consolidated in the Fault Forecastingarea. It is de�ned in [2] as: �The depend-ability validation technique that is imple-mented by means of controlled tests wherethe observation of the behaviour of the sys-tem in presence of faults is explicitly in-duced by the deliberate introduction (in-jection) of faults in the system�. Whena simulation-based fault injection is used,faults must be added to the system model.The results of the injected model are com-pared against a fault-free simulation runcalled the golden run.

Several simulation-based fault injectiontechniques using a HDL language to de-scribe the system under test have been pro-posed in the literature. One of these tech-niques uses simulator commands to forcethe value of the signals that connect the ele-ments of the design, thus generating a fault.As the fault is injected into the model atsimulation time, the original model needsno modi�cation or recompilation, makingthis technique a popular solution. Thistechnique is known as forcing and it usedfor the tool presented here.

To gather the statistical information neededto assess the model dependability results alarge set of simulations (several thousandsusually) must be carried out. Every simula-tion run of this campaign is the simulationof the model in the presence of a single fault.

The massive simulation workload for thefault injection campaign naturally �ts in adistributed environment. Simulation runsare independent of each other, so theycan be treated as di�erent simulator exe-cutions. This independence allows the useof general-purpose resource sharing tools tomanage simulations in complex heteroge-neous distributed environments.

This approach makes no use of the fea-tures of current simulators, where a restartof the simulation model is possible, allow-ing several simulations to be carried outwithout the OS overhead of �nishing thesimulator program and executing it again.Those general-purpose tools lose a simula-tion speed-up opportunity here, and if it issu�ciently large, it can justify the use of aspecialised simulation management tool.

A specialised tool is needed to generate theexperiment campaign data anyway, withenough �exibility to let the user expressthe analysis that must be carried out. Itmust include the management of the sim-ulations itself or be easily coupled with ageneral-purpose distributed simulation en-vironment if such a system is used.

Our research group has already developedsuch tool, called VFIT [3]. VFIT is apowerful and mature fault injection toolfor VHDL models but lacks the distributedsimulation capabilities mentioned above, sowe must resort to a general-purpose tool forthe simulation workload distribution.

The software toolkit presented here ful�lsthese simulation questions, incorporating asimulation framework that uses simulatorcommands to inject faults into the systemmodel and to speed-up simulations. Wehave called this set of software elements theFIASCO toolkit (Fault Injection Aid Soft-ware COmponents).

The goal of the FIASCO toolkit and theexperiments presented here is to determineif noticeable performance improvements arepossible with the use of a specialised simu-lation framework, before including this fea-ture in the next release of VFIT. As thework presented here is based on an inter-preted language, it is expected that thecompiled nature of VFIT will even increasethe performance obtained with FIASCO.

143

Figure 1: FIASCO toolkit component interconnection.

2. The FIASCO toolkit

Modelsim [4] is a very popular HDL sim-ulator from a mayor EDA Company thatruns on a large variety of system architec-tures. A built-in Tcl/TK interpreter is em-bedded into the simulator, giving the useran unlimited expandability with the use ofscripts that can be dynamically loaded andexecuted. These scripts can add new com-mands or modify existing ones. This simu-lator provides a restart command to speedup a set of simulations from a single pro-gram execution. These capabilities are ex-ploited by FIASCO to graphically assist theuser in his dependability research and tomanage simulations in a distributed envi-ronment.

The FIASCO toolkit consists of two scripts(see Fig. 1). The client (called FIASCO-C) generates the campaign data, performsthe statistical analysis and controls thesimulation in the distributed environment.The server (FIASCO-S) executes within themodelsim simulator in each simulation host.It receives simulation requests from theclient and uses the simulator commands tocarry them out. The FIASCO-C script in-

corporates the experience gained and thetechnologies developed within the develop-ment of our group's tool VFIT.

Sockets are used as the communicationchannel between the client and the simu-lation servers. To support network failures,the client/server protocol is used to issue re-quests and signal when those requests havebeen accomplished only. A simulation re-quest is a request to inject a con�gurablenumber of faults into the system model us-ing the restart command. The result of eachsimulation is locally stored into the serverhard disk. Results �les are retrieved by theclient using the standard ftp protocol whenthe network is functioning.

If the network fails for some reason, theconnection socket between server and clientcloses, but the server continues simulatinguntil all the requested faults have been in-jected. If it can not signal the end of therequest to the client, it simply waits for theclient to reconnect. When the client con-nects with a simulation server it �rst re-quests the server status to determine if theserver is idle or not, so the protocol on theclient side can resynchronise accordingly. Ifidle, it issues a new simulation request in a

144

short message and opens an ftp connectionto retrieve result �les.

With this arrangement, intermittent net-work failures impact is minimised as theclient sends simulation requests (very shortmessages, minimal network utilisation)even if result �les can not be transferred.

To ensure interoperability between clientand servers of di�erent architectures, result�les are plain text �les. However, as text-based trace �les tend to be very large due tothe �le format the simulator uses, we havedeveloped a new format. This is still plaintext, but it achieves a reduction ratio be-tween 19 and 26 (depending on the trace it-self). The �nal �le is then compressed, fora total compression ratio above 99 % fromthe original text �les.

3. Test description and

experimental results

To test the FIASCO toolkit we have used acomplex SoC system. It comprises a MIPSR3000 RISC processor, an instruction cacheand a set of interconnection elements for theinternal bus that connect the SoC with theexternal memories.

We have added a trace facility inside theprocessor's model, and a VHDL testbenchto incorporate the external memories intothe model. The program space is �lled atstart-up time with the Eratosthenes sieveprime number generator program inside anin�nite loop. This allows us to arbitrar-ily change the number of simulated system

clock cycles.

We have carried out a simple experiment inorder to estimate our tool's performance,studying the simulation time for a sin-gle simulation server for di�erent testbenchcon�gurations. This server is a SuSE Linux6.4 in a medium range PC box and it isconnected to the client through a 100MbpsLAN. To study the simulation time for dif-ferent trace �le sizes, we simply change thenumber of clock cycles to be simulated.

We will call TC to the mean time to executethe simulator program for a single injectionof C clock cycles, and n to the number of in-jections performed. With these de�nitions,a lower bound for the simulation time usinga general-purpose tool is simply derived asn× TC .

This bound does not take into account thetransfer time for the result �les but it is in-cluded in the measured times from the FI-ASCO toolkit. However, as the �nal �le tobe transferred (after compression) is rela-tively small, the measured transfer time isnegligible.

The study ranges n from 1 to 50 and setsC to 100, 500 and 1000 clock cycles. Re-sults use n × TC as a reference to obtainthe relative percentage of performance im-provement.

Simulation times are shown in the Table 1.The total time is the actual measurementand the mean simulation time is derivedfrom this value and the number of injec-tions. The relative improvement is the ratioof the total simulation time and the refer-ence used for a general-purpose tool, n×TC .

145

Table 1: Total and mean simulation times (in seconds), relative performance improve-ment.

Total simulation time Mean simulation time Relative improvement

Simulated clock cycles Simulated clock cycles Simulated clock cycles

n 100 500 1000 100 500 1000 100 500 1000

1 58,35 101,49 179,69 58,35 101,49 179,69 0,00 % 0,00 % 0,00 %

2 75,29 180,45 337,94 37,64 90,23 168,97 35,48 % 11,10 % 5,97 %

5 141,32 419,00 808,49 28,26 83,80 161,70 51,56 % 17,43 % 10,02 %

10 244,91 825,30 1615,13 24,49 82,53 161,51 79,01 % 18,68 % 10,12 %

50 1156,73 4033,00 7984,95 23,13 80,66 159,70 86,78 % 20,53 % 11,13 %

Results clearly show the bene�ts from usinga specialised simulation tool like FIASCOto manage simulations. The improvementgrows with the number of simulations (n) tobe carried out in the server host and it is in-versely proportional to the time needed fora single stand-alone simulation (TC). Thismeans those small to medium size installa-tions get more bene�t per host (as a sin-gle host has more simulations to do). Re-sults also show that the bene�t increases forsmall to medium models (as larger modelsincrease TC).

4. Conclusions

In this paper, a specialised simulationframework for dependability assessment us-ing fault injection into VHDL models hasbeen presented. The performance improve-ment obtained with our FIASCO toolkit us-ing a simple experiment has been also pre-sented and the results are very interesting.

FIASCO is especially well suited for faultinjection experiments needing a large num-ber of short simulations. The performancebene�t ranges from 35 % to 86 % per sim-

ulation, depending on the total numberof simulations a single host must perform.These �gures prove the usefulness of such atool in the dependability assessment �eld.

The simulation techniques used in this workare no applicable to all simulation caseshowever. There are simulation experimentsin the �eld of fault injection into VHDLwere the use of the tool delivers a mod-erate improvement. In our current devel-opment, we are trying to surpass the per-centage improvement for long simulations.This would make FIASCO a useful tool fora wider range of simulation studies.

Acknowledgements

This work is partially supported by theSpanish Government's Comisión Intermin-isterial de Ciencia y Tecnología underproject CICYT TAP99-0443-C05-02.

Bibliography

[1] Avresky, D., Grosspietsch, K.E., Jhonson,D.W., and Lombardi, F.: Embedded Fault Tol-

146

erant Systems. IEEE Micro Magazine, 8-11,Vol. 18, No. 5, 1998.

[2] Arlat, J., Aguera, M., Amat, L., Crouzet, Y.,Fabre, J. C., Laprie, J. C., Martins, E., Pow-ell D.: Fault injection for dependability vali-dation: A methodology and some applications.IEEE Transactions on Software Engineering,166-182, vol. 16, 1990.

[3] Baraza, J. C., Gracia, J., Gil, D., Gil, P. J.:A Prototype of a VHDL-Based Fault InjectionTool. Description and Application. Journal ofSystems Architecture, special issue on Defectand Fault Tolerance in VLSI Systems, to ap-pear.

[4] Modelsim SE v5.5b Command Reference.Model Technology Inc., 2001.

147

Parte III:

Conclusiones

Capı́tulo 13Conclusiones

All generalizations are false, including this one.Mark Twain

En este capítulo se resumen las aportaciones del presente trabajo, las diferen-tes publicaciones a las que ha dado lugar, y por último se plantean algunas delas líneas de investigación que quedan abiertas y que pueden guiar el trabajofuturo.

13.1. Introducción

Complementado con mecanismos de enmascaramiento o detección de erroresen los datos, el uso de un procesador de guardia es una alternativa viabley de menor complejidad que la pura redundancia espacial necesaria en lossistemas duales.

La detección de errores de control de �ujo mediante el uso de un procesadorde guardia permite obtener un alto nivel de con�anza en que el procesadorejecuta las instrucciones que se le requieran y en el orden en el que se lerequieran, lo que abre la puerta a mecanismos de detección de errores dedatos basados en software, lo que indudablemente abunda en un importanteahorro de costes.

Sin embargo, para que un mecanismo de este tipo sea utilizado en la práctica,

149

deben tenerse en cuenta una serie de condiciones:

La arquitectura del procesador que se va a monitorizar debe ser mo-di�cada lo menos posible, a �n de facilitar la inclusión del procesadorde guardia como un elemento añadido de seguridad sin requerir uncomplejo rediseño del procesador o de su juego de instrucciones.

La utilización del procesador de guardia debe ser lo más transparenteposible al usuario �nal, el programador. Si el programador debe dis-poner de conocimientos altamente especí�cos sobre los mecanismos dedetección de errores incorporados en el sistema para poder utilizarlosde forma e�caz, podemos concluir que éstos serán descartados, infra-utilizados o incluso peor, utilizados de forma incorrecta y generandouna falsa con�anza en el funcionamiento del sistema en presencia deerrores.

Cualquier mecanismo de detección de errores incorporado a un sistemaque originalmente no disponía de él incurre en costes adicionales, seaen el diseño del sistema, en la complejidad del circuito resultante, en lamemoria necesaria para su inclusión, o en las prestaciones obtenidas. Eneste último caso, la pérdida de prestaciones puede venir de dos fuentesdiferentes, a saber: i) que el procesador necesite ciclos adicionales paraejecutar el programa tras la inclusión del mecanismo de detección deerrores, o ii) que, al insertar el mecanismo de detección de errores comoparte inherente a la ejecución de instrucciones, se inserte sobre una rutatemporalmente crítica, obligando a reducir la frecuencia del reloj delsistema.

En resumen, las características ideales de un mecanismo de detección deerrores (amén de sus parámetros como tal mecanismo, una máxima coberturade detección con una mínima latencia) serían las siguientes:

No modi�car el juego de instrucciones original del procesador que se vaa monitorizar

No modi�car la arquitectura del procesador que se va a monitorizar

No requerir lógica adicional, ni memoria

No in�uir en el tiempo de ejecución, de forma que las prestacionesresultantes sean las mismas antes y después de su inclusión.

150

Adicionalmente, otras características deseables son: i) que no dependa para suimplantación de características especí�cas del procesador que va monitorizary ii) que mantenga la máxima compatibilidad con el código binario existente,aún a costa de su objetivo primigenio, la detección de errores.

13.2. Aportaciones

Para realizar el trabajo que aquí se presenta se han analizado las diferentespropuestas para la inclusión de mecanismos de detección de errores del controlde �ujo de ejecución de un procesador, tanto software como hardware.

La aportación más novedosa (y núcleo central de este trabajo) ha consistidoen la propuesta de una nueva técnica de empotrado de �rmas derivadas enel espacio de instrucciones del procesador a monitorizar.

Esta técnica, denominada ISIS (acrónimo de Interleaved Signatures Instruc-tion Stream), no depende de ninguna característica especí�ca de la arquitec-tura o juego de instrucciones del procesador a monitorizar, lo que permite suutilización sobre cualquier arquitectura.

Las �rmas se intercalan (de ahí el nombre de esta técnica) entre los bloquesbásicos de ejecución secuencial del programa original. Para la veri�cación delas instrucciones ejecutadas por el procesador principal no se requiere ningu-na modi�cación al juego de instrucciones del procesador, lo que permite unacompatibilidad binaria total con el software pre-existente. Como diferenciafundamental en la forma de realizar el empotrado de las �rmas respecto depropuestas previas, y que resulta en una apreciable mejora de las presta-ciones del sistema, estas �rmas pasan completamente inadvertidas para elprocesador principal gracias al cambio introducido en la semántica de lasinstrucciones de salto condicional.

El procesador principal, tras la ejecución de una instrucción actualiza el con-tador de programa para ejecutar la instrucción almacenada a continuación.El cambio semántico al que se alude anteriormente consiste en que el procesa-dor, si durante la ejecución de una instrucción de salto condicional determinaque la condición del salto no se cumple (y que por tanto la ejecución debeseguir el orden secuencial usual), actualiza el contador de programa de formaque �esquiva� una posición, evitando la búsqueda y ejecución de una palabrade memoria; el hueco así generado es aprovechado para insertar la �rma delbloque.

151

Para permitir que el procesador de guardia tenga acceso a dicha palabrade forma simultánea al acceso a instrucciones del procesador principal, lamemoria caché de instrucciones debe disponer de dos puertos de acceso, unopara el procesador principal y otro para el de guardia.

Tampoco se requiere que el programador realice ningún cambio al códigofuente de los programas para utilizar e�cazmente este mecanismo de detec-ción de errores. En la implementación práctica de esta técnica de inserciónde �rmas sobre un procesador RISC de la familia MIPS se han modi�cadolas herramientas software de desarrollo para conseguir que el sistema incor-pore las �rmas de forma automática con el simple añadido de una opción decompilación adicional.

Además de proponer una técnica original para el empotrado de �rmas, y comosegunda aportación de esta tesis, dentro de los mecanismos de veri�cación dela integridad estructural del programa ejecutado por el procesador principalse ha añadido una propuesta que permite veri�car un salto cuando éste seproduce a múltiples destinos (todos conocidos en el momento del enlazado)de una forma sencilla y elegante.

La veri�cación del salto cuando tiene múltiples destinos se consigue realizan-do la comprobación desde el bloque alcanzado. Esta veri�cación, denominadaSAC (Source Address Checking) se consigue almacenando en cada uno de losposibles destinos un checksum del desplazamiento del salto recién ejecutado.Es la primera vez que se propone una fórmula de veri�cación tras el salto en laliteratura de la materia. En las propuestas previas, cuando existen múltiplesdestinos y se desea veri�car el salto, esta veri�cación se consigue mediante i)el uso de �rmas de ajuste entre el camino más frecuente y los demás destinosposibles, o ii) utilizando algún tipo de tabla en el bloque origen conteniendotodos los posibles destinos.

La originalidad de la veri�cación SAC es que permite utilizar un solo campo(de cada uno de los nodos destino implicados) para realizar, tras el salto,una única comprobación para veri�car que el salto se ha producido desde elbloque correcto.

El desarrollo práctico de las propuestas anteriores ha dado lugar a la terceraaportación de esta tesis: un modelo sintetizable del sistema desarrollado en ellenguaje de descripción de hardware VHDL, que junto al procesador principalde arquitectura MIPS incluye un procesador de guardia, memoria caché deinstrucciones, bus intra-chip AMBA, interfaz con memoria externa, y todolo necesario para su utilización en un sistema real sobre un dispositivo lógico

152

programable.

A este sistema se le ha denominado HORUS y es la aportación práctica másimportante del trabajo descrito en esta memoria.

La estructura de HORUS permite también la ejecución de procesos que no lle-van asociadas �rmas para la veri�cación de la integridad de las instruccionesque se ejecutan en el procesador, pudiendo activar y desactivar la detecciónde errores en función de la tarea en ejecución, lo que permite una migraciónincremental hacia la inclusión de �rmas en las diferentes tareas de un sistemay una total compatibilidad con el software ya existente.

El sistema HORUS permite la utilización usual de interrupciones y excep-ciones en el sistema, que en propuestas de otros procesadores de guardia noera posible. Hay que tener en cuenta, sin embargo, que la latencia de detec-ción se ve negativamente in�uenciada si, durante la ejecución de un bloquede instrucciones con un error y antes de que la ejecución de éste termine, elprocesador pasa a ejecutar el manejador de una interrupción o excepción, yaque la detección no se produce hasta que el procesador principal no ejecutala última instrucción del mencionado bloque.

Una interesante característica de la estructura interna de HORUS es que lapérdida de prestaciones no supera el 6% del tiempo de ejecución en ningunade las pruebas llevadas a cabo, quedando en muchos casos por debajo del2%. Lamentablemente, comparte con las demás propuestas de procesadoresde guardia unos requerimientos de memoria que pueden llegar a incrementarlas necesidades de espacio de memoria de los programas hasta el 30%.

Con HORUS se ha querido demostrar la factibilidad práctica de las pro-puestas lanzadas con ISIS, de manera que el trabajo tiene una inmediataaplicación. También viene este desarrollo a solventar las dudas (que por otraparte surgen de forma natural) sobre la posibilidad de incorporar de unaforma completamente automática las �rmas sobre un código fuente sin queel programador tenga que hacer modi�cación alguna. Estas dudas quedansolventadas con la cuarta aportación de esta tesis, que consiste en la modi�-cación de aplicaciones de sobra conocidas para que la generación e inserciónautomática de las �rmas de referencia sea una realidad: el conjunto de uti-lidades conocido como binutils que acompaña al compilador gcc de GNU,fundamentalmente el ensamblador gas, el enlazador/cargador ld y la libre-ría bfd que permite gestionar el formato del código objeto (en nuestro casoel formato elf o Executable and Linkable Format). Se ha elegido este entornode desarrollo por varias razones, entre las que cabe destacar:

153

1. Que, al ser un desarrollo de código abierto, cualquier modi�cación pue-de ser incorporada sin mayores problemas.

2. Que, dada la propia estructura de funcionamiento del proceso de com-pilación/enlazado, los cambios realizados al ensamblador, enlazador ylibrería base son automáticamente aplicados a cualquier lenguaje dealto nivel soportado por el procesador gcc.

3. Que las herramientas de GNU gozan de un amplio reconocimiento porsu versatilidad y aplicación en el desarrollo de aplicaciones de propósitogeneral y en el ámbito industrial, siendo utilizadas todos los días parael desarrollo de aplicaciones reales. No se trata pues de una herramientameramente académica, lo que despeja las posibles dudas de su posibleutilización en sistemas reales.

Como aportación �nal, conjuntamente al desarrollo de HORUS se han creadouna serie de de rutinas software para la inyección de fallos en el modelo VHDLdel sistema, a �n de veri�car la funcionalidad del procesador de guardia.Este conjunto de utilidades, escrito completamente en el lenguaje de scriptstcl se ha denominado FIASCO (acrónimo de Fault Injection Aid SoftwareCOmponents).

13.3. Conclusiones

Del conjunto de experimentos de inyección de fallos llevado a cabo paraveri�car la funcionalidad del sistema sólo podemos concluir que la cantidadde fallos que sería necesario inyectar para caracterizar los mecanismos dedetección de errores es enorme.

A pesar de haber inyectado algunos miles de fallos, generados aleatoriamen-te y repartidos uniformemente por todo el sistema, todos aquellos que hanproducido una avería en el sistema de las que el procesador de guardia estápreparado para detectar han sido detectados por algún mecanismo, bien delprocesador de guardia (integridad de la longitud del bloque, de las instruc-ciones ejecutadas, del salto realizado) o bien del sistema original (accesos aposiciones de memoria ilegales detectadas por la MMU del procesador, porejemplo).

Como no es posible desarrollar un mecanismo de detección con una coberturadel 100%, y puesto que es fácil imaginar algún caso que el procesador de

154

guardia no puede detectar, se deduce sencillamente que para que en unacampaña de inyección de fallos generada de forma aleatoria y uniforme sobretodo el sistema aparezcan esos casos críticos que hacen bajar la coberturadel más que improbable 100% es necesario inyectar varias decenas o inclusocentenares de miles de fallos en el sistema.

Por último, del análisis de prestaciones y consumo de memoria del sistemaresultante podemos concluir que:

1. Una de las causas de la pérdida de prestaciones del procesador principalreside en la interferencia que sobre el funcionamiento de la memoriacaché de instrucciones produce el acceso a posiciones de memoria delprocesador de guardia para la recogida de las �rmas.

Teniendo en cuenta que el procesador de guardia monitoriza las ins-trucciones que entran al pipeline del procesador principal no sólo paraveri�car si ha habido errores sino también para determinar dónde estála �rma que debe utilizar como referencia, que dicha �rma es reque-rida por el procesador de guardia en cuanto el procesador principalcomienza la ejecución de un bloque secuencial básico, y que además lasusodicha �rma se almacena en la posición de memoria inmediatamenteanterior a la primera instrucción del mencionado bloque, podemos de-ducir que las propias características que hacen que una memoria cachétenga un buen rendimiento (localidad espacial y temporal) ayudan aque la interferencia del procesador de guardia sea pequeña.

Expresado en pocas palabras, la interferencia del procesador de guardiasobre el funcionamiento de la caché de instrucciones es pequeña porquelas posiciones requeridas por éste coinciden en el espacio y en el tiempocon las instrucciones que está ejecutando el procesador principal.

2. Otra de las causas de la pérdida de prestaciones del procesador principalreside en el hecho de que, a pesar de que este último no ejecuta nirequiere las �rmas de cada bloque, lo cierto es que en algunos casoses necesario insertar nuevas instrucciones al programa del procesadorprincipal que originalmente no aparecían.

Estas instrucciones han de ser insertadas cuando, por ejemplo, el nú-mero de instrucciones del bloque secuencial supera la cantidad máximaprevista en el campo de longitud de la �rma asociada. En este caso, elsistema de generación de �rmas inserta una instrucción de salto paraforzar la división en varios bloques secuenciales de menor tamaño. Es-tas instrucciones de salto sí son recuperadas de la memoria y ejecutadas

155

por el procesador principal, y puesto que no existían en el programaoriginal, el tiempo dedicado a su ejecución ha de ser considerado comotiempo perdido por el procesador principal.

Esta penalización en las prestaciones puede aliviarse sin más que au-mentar la longitud máxima permitida a un bloque secuencial, �jada en16 instrucciones en la propuesta de ISIS. Debe tenerse en cuenta, sinembargo, las implicaciones que este aumento tendrá sobre el sistemaresultante:

a) Al aumentar el campo de longitud en la �rma de un bloque esnecesario reducir alguno de los otros campos dedicados a la detec-ción de errores. Será necesario evaluar, pues, cómo esa reducciónafecta a la cobertura del mecanismo afectado.

b) El hecho de permitir un bloque secuencial de mayor longitud im-plica que, en el peor de los casos, transcurre más tiempo (el tiempodedicado a ejecutar más instrucciones) desde la ocurrencia de unerror hasta la detección; es decir, se está afectando negativamentea la latencia de detección.

c) También se está afectando negativamente a la probabilidad de de-tección (la cobertura) de un error sobre el código binario de lasinstrucciones, como sería por ejemplo la transformación de unainstrucción de suma en una de resta al corromperse el código deoperación. Esto es así porque la probabilidad de detección del fallo,manteniendo �ja la longitud del código utilizado como respaldo,disminuye al incrementar la cantidad de instrucciones involucra-das; el efecto en este caso es, sin embargo, de mucho menor caladodadas las excelentes características de detección de los códigos deredundancia cíclica empleados.

3. Las necesidades de espacio de memoria para utilizar las propuestas deISIS están muy ligadas a la estructura del programa que se pretendeveri�car.

Por un lado, la existencia de bloques secuenciales demasiado largos parael procesador de guardia forzará, durante el ensamblado, la inserciónde instrucciones de salto que dividan dichos bloques en otros de menortamaño.

Por otro lado, la existencia de muchos bloques de pequeño tamaño su-pone un incremento relativo muy alto del consumo de memoria, dadoque cada bloque (de la longitud que sea) requiere de una �rma (que

156

equivale en espacio de memoria a una instrucción del procesador prin-cipal). Evidentemente el incremento relativo de memoria para insertaruna �rma de un bloque de 4 instrucciones es del 25%, pero sólo del12.5% si dicho bloque tiene 8 instrucciones.

A menos que se cambie la formulación del compilador en su traduccióndel código de alto nivel para que tienda a generar bloques de tamañointermedio (si ello es posible, claro), la forma de reducir esta sobrecargade memoria estará en:

a) Reducir el tamaño de la �rma de un bloque secuencial, siemprey cuando esta reducción permita reducir el consumo de posicio-nes de memoria. No es el caso de la arquitectura MIPS, dado quetoda instrucción es de 4 bytes y reducir la �rma a la mitad nopermitiría aprovechar los dos bytes de hueco resultante (las ins-trucciones deben mantenerse alineadas en direcciones múltiplo de4). Sin embargo, si sería en principio posible en otras arquitectu-ras, como por ejemplo la MIPS16, que utiliza instrucciones de 2bytes alineadas en direcciones pares.

b) Eliminar la �rma de un bloque secuencial como una informaciónseparada del programa principal, de manera que dicha informa-ción se �integre�, por ejemplo, en los huecos no utilizados de lasinstrucciones del procesador principal.

13.4. Publicaciones directamente relacionadas

con el trabajo de tesis

La propuesta de empotrado de �rmas y el conjunto de mecanismos de de-tección de ISIS junto con la descripción de la arquitectura del procesadorHORUS que las implementa fueron publicados en

F. Rodríguez, J.C. Campelo, J.J. Serrano. A Watchdog Processor Ar-chitecture with Minimal Performance Overhead.Proc. SAFECOMP'02, Lecture Notes in Computer Science, vol. 2434,pp. 261-272, Catania, Italia, 2002

F. Rodríguez, J.C. Campelo, J.J. Serrano. The HORUS Processor.Proc. XVII Conference on Design of Circuits and Integrated Systems(DCIS'2002), Santander, España, 2002

157

Los resultados sobre el análisis de prestaciones y el consumo de memoriase publicaron en los artículos que a continuación se enumeran; en el últimotrabajo se apuntan ya posibles soluciones para reducir el consumo de memoriasin afectar negativamente las prestaciones.

F. Rodríguez, J.C. Campelo, J.J. Serrano. Delivering Error DetectionCapabilities into a Field Programmable Device: The HORUS Proces-sor Case Study. Proc. 2002 IEEE International Conference on Field-Programmable Technology (FPT'2002), Hong-Kong, China, 2002

F. Rodríguez, J.C. Campelo, J.J. Serrano. A Memory Overhead Evalua-tion of the Interleaved Signature Instruction Stream. Proc. 17th IEEEInt. Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'02),Vancouver, Canadá, 2002

F. Rodríguez, J.C. Campelo, J.J. Serrano. Improving the InterleavedSignature Instruction Stream Technique. Proc. WSEAS Intl. Conferen-ce (ICAI'2002), Santa Cruz de Tenerife, España, 2002

F. Rodríguez, J.C. Campelo, J.J. Serrano. Improving the InterleavedSignature Instruction Stream Technique. Proc. IEEE C. Conference onElectrical and Computer Engineering (CCECE'2003), Montreal, Cana-dá, 2003

En el siguiente trabajo se puede encontrar la descripción de las modi�ca-ciones a la suite binutils para la generación automática de �rmas y sobrelos experimentos realizados para obtener la cobertura de detección de losmecanismos de detección de errores basados en la dirección de salto.

F. Rodríguez, J.J. Serrano. Control Flow Error Checking with ISIS.Proc. Intl. Conference on Embedded Systems and Software(ICESS'2005), Lecture Notes in Computer Science, vol. 3820, pp. 659-670, Xi'an, China, 2005

Finalmente, los trabajos publicados sobre la herramienta FIASCO y la inyec-ción de fallos en el modelo VHDL del sistema HORUS utilizando un sistemadistribuido para la reducción del tiempo del experimento fueron

F. Rodríguez, J.C. Campelo, J.J. Serrano. Reducing the VHDL-BasedFault Injection Simulation Time in a Distributed Environment. Infor-mal Digest Proc. 7th IEEE European Test Workshop (ETW'02), Corfú,Grecia, 2002

158

F. Rodríguez, J.C. Campelo, J.J. Serrano. A Distributed SimulationEnvironment for Fault Injection Analysis on SoC Models. Proc. 5thIEEE Design and Diagnostics of Electronic Circuits and Systems Intl.Workshop (DDECS'2002), Brno, República Checa, 2002

13.5. Trabajo futuro

Como trabajo futuro se plantean las siguientes líneas de actuación:

Una primera línea de trabajo viene determinada por la caracterizacióndel procesador de guardia implementado en el sistema HORUS, paradeterminar de forma experimental la cobertura de detección de erroresy su latencia de detección.

No se trataría de determinar estos parámetros sin más, sino de realizarun estudio más amplio para determinar la posibilidad de reducir los bitsrequeridos por la �rma de los nodos (manteniendo una alta coberturade detección, evidentemente) a �n de poder dotar de más bits a loscampos utilizados en la veri�cación de los saltos.

Se trata de encontrar un balance entre los bits dedicados a cada tipode veri�cación para conseguir una alta cobertura de detección en todosellos, sin perjudicar de forma signi�cativa a ninguno.

Otra vía natural de continuación de este trabajo es la incorporación alsistema de mecanismos de recuperación de errores. En el sistema actual,cuando se detecta un error el procesador de guardia activa una línea desalida de error. Esta línea se puede conectar a una señal interna paragenerar una excepción software en el procesador, o a una señal externa(para forzar la reinicialización del sistema, por ejemplo). Sin embargo,no se ha ido más allá en el tratamiento que el sistema debe dar a ladetección de dicho error.

Del estudio llevado a cabo sobre las diferentes técnicas de inserción de�rmas se puede concluir que aún quedan algunos casos en las estruc-turas de ejecución que no pueden ser cubiertas por ningún mecanismode detección.

Es cierto que con la propuesta del trabajo que aquí se presenta se hanañadido a la lista de situaciones detectables los saltos con múltiples

159

destinos (que anterioremente no era posible veri�car) mediante la in-clusión de un nuevo tipo de veri�cación en el que, una vez alcanzadoel destino del salto, se veri�ca si el origen del mismo es correcto. Peroestos saltos con destino múltiple son veri�cables sólo en el caso de que,para cada uno de los destinos posibles, el origen sea único.

Sin embargo, cuando un bloque es alcanzado desde dos o más bloquesorigen con un salto a múltiples destinos, dicho bloque destino no puedeincluir la veri�cación del nodo origen (pues éste debe ser único). Esnecesario, por tanto, hacer un análisis de porqué aparecen este tipo deestructuras y de cómo incorporar algún tipo de mecanismo que permitaveri�car el salto realizado por el procesador, lo que permitiría ampliarel conjunto de saltos veri�cables.

En este sentido, una línea que parece prometedora es la réplica de estosbloques o la generación automática de bloques puente.

Finalmente, cabe resaltar que sería interesante disponer de un mecanis-mo de detección de errores que no tuviera unos requisistos de memoriatan exigentes. Para ello ya se ha planteado en alguno de los artículos pu-blicados mencionados en la sección anterior la posibilidad de empotrarlas �rmas de los bloques en los campos no utilizados de las instruc-ciones. Para aquellos bloques que no dispongan de huecos su�cientesen las instrucciones originales habría que añadir instrucciones de nooperación en los que insertar los bits restantes.

Es necesario evaluar en este caso el incremento de complejidad delprocesador de guardia, pues deberá ahora �ltrar las instrucciones delprocesador principal para extraer de ellas las �rmas de los bloques, y sieste incremento de complejidad se ve compensado con la simpli�caciónde la memoria caché de instrucciones, que sólo requeriría un puerto deacceso (para el procesador principal).

También será necesario hacer un estudio del juego de instrucciones delas arquitecturas más relevantes para determinar si existen en ellos hue-cos su�cientes como para que esta modi�cación a la técnica de inserciónde �rmas tenga aplicación práctica.

160

Bibliografía

[1] María Eulalia Fuentes i Pujol. Documentación cientí�ca e información:Metodología del trabajo intelectual y cientí�co. Escuela Superior de Re-laciones Públicas: Promociones y Publicaciones Universitarias SP, 1992.Barcelona, España.

[2] A. Avizienis. Building dependable systems: How to keep up with com-plexity. In Proceedings of the 25th Fault Tolerant Computing Symposium(FTCS-25), pages 4�14, 1995. Pasadena, California.

[3] U. Gunne�o, J. Karlsson, and J. Torin. Evaluation of error detectionschemes using fault injection by heavy-ion radiation. In Proceedings ofthe 19th Fault Tolerant Computing Symposium (FTCS-19), pages 340�347, 1989. Chicago, Illinois.

[4] E. W. Czeck and D. P. Siewieorek. E�ects of transient gate-level faults onprogram behavior. In Proceedings of the 20th Fault Tolerant ComputingSymposium (FTCS-20), pages 236�243, 1990. NewCastle Upon Tyne,U.K.

[5] J. Gaisler. Evaluation of a 32-bit microprocessor with built-in concurrenterror detection. In Proceedings of the 27th Fault Tolerant ComputingSymposium (FTCS-27), pages 42�46, 1997. Seattle, Washington.

[6] J. Ohlsson, M Rimén, and U. Gunne�o. A study of the e�ects of tran-sient fault injection into a 32-bit RISC with built-in watchdog. In Pro-ceedings of the 22th Fault Tolerant Computing Symposium (FTCS-22),pages 316�325, 1992. Boston, Massachusetts.

[7] D. P. Siewiorek. Niche sucesses to ubiquitous invisibility: Fault-tolerantcomputing past, present, and future. In Proceedings of the 25th Fault To-

161

lerant Computing Symposium (FTCS-25), pages 26�33, 1995. Pasadena,California.

[8] R. K. Iyer N. Nakka, Z. Kalbarczyk and J. Xu. An architectural fra-mework for providing reliability and security support. In Proceedings ofthe 2004 International Conference on Dependable Systems and Networks(DSN-2004), pages 585�594, 2004. Florence, Italy.

[9] J. Gaisler. Concurrent error-detection and modular fault-tolerance inan 32-bit processing core for embedded space �ight applications. InProceedings of the 24th Fault Tolerant Computing Symposium (FTCS-24), pages 128�130, 1994. Austin, Texas.

[10] J. L. Hennessy and D. A. Patterson. Computer Architecture. A Quan-titative Approach. Morgan-Kau�mann Publisher Inc., second edition,1996.

[11] N. Oh, P. P. Shirvani, and E. J. McCluskey. Control �ow checking bysoftware signatures. IEEE Transactions on Reliability Special Sectionon Fault Tolerant VLSI Systems, 51(2), March 2002.

162

departamento de informática de sistemas y computadores

Documents