dialnet-sesgodenorespuestaymodelosdesuperpoblacionenencues-3811090

Upload: luis-gonzalo-trigo-soto

Post on 04-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    1/28

    Jos M. Pava: Universitat de Valncia | [email protected] Larraz: Universidad de Castilla-La Mancha | [email protected]

    Reis 137 , enero-marzo 2012, pp. 237-264

    Nonresponse Bias and Superpopulation Modelsin Electoral Polls

    Sesgo de no-respuesta y modelos de superpoblacin en encuestas electorales

    Jos M. Pava y Beatriz Larraz

    doi:10.5477/cis/reis.137.237

    INTRODUCCIN 1

    Concern about the potential consequencesof nonresponse 2 in survey research is as old

    1 The research reported in this paper is based on thestudy Un Anlisis para la Mejora de la Calidad Predic-tiva en las Estimaciones Electorales a partir de Datos deEncuesta funded by the Centro de Investigaciones So-ciolgicas through a Sociological Research Grant awar-

    as the discipline itself. According to Smi-th(1999), Early research extends back to theemergence of polling in the 1930s and has

    been a regular feature in statistical and social

    ded in 2009. The authors wish to thank Alberto Penadsand Valentn Martnez for their rst-rate assistance, twoanonymous referees and the REIS Editorial Board fortheir valuable comments and suggestions, and TonyLittle for revising the English of the manuscript. Any

    Key wordsElectoral forecasts Voting behaviour Recall vote ClusterSampling Pre-elec-tion polls Monte Carlosimulation Spanishelections

    AbstractNonresponse bias (and, to a lesser extent, measurement error) has be-come the main source of error for electoral forecasts in Spain. Althoughthe post-stratication techniques and ratio estimators currently used inthe polling industry reduce deviations, they do not show enough capa-city to mend the biases introduced when collecting data. This researchreveals how a more efcient use of the electoral information availableoutside the sample could help to signicantly improve the accuracyof predictions, and uses simulation techniques to show that this maybe accompanied by less expensive sampling designs. The analysis,nevertheless, also concludes that the proposed specication is not apanacea and afrms that there is still scope for reducing nonresponsebias, pointing to several issues for future research.

    Palabras clavePredicciones electo-rales Comporta-miento electoral

    Recuerdo de voto Muestreo por conglo-merados Encuestaspreelectorales Simu-lacin de Monte Carlo Elecciones espaolas

    ResumenEl sesgo de no-respuesta (y, en menor medida, el error de respuesta) seha convertido en la principal fuente de error de las predicciones electo-rales en Espaa. Las tcnicas de post-estraticacin y los estimadores

    ratio utilizados actualmente por la industria demoscpica no muestranuna capacidad suciente para corregir los sesgos introducidos durantela recogida de datos. Este trabajo revela cmo un uso ms ecientede la informacin electoral extramuestral disponible permitira mejorarsensiblemente la precisin de las estimaciones y muestra, utilizandotcnicas de simulacin, que ello podra venir acompaado de diseosmuestrales ms baratos. El estudio, no obstante, concluye que la es-pecicacin utilizada en esta investigacin no constituye una panaceay seala que existe todava margen para la correccin del sesgo de no-respuesta, apuntando diversas posibilidades de investigacin futura.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    2/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    238 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    science journals since the 1940s . Despitethis, it has not been until more recently thatthe scienti c community has devoted moreattention to the issue, in response to the con-sequences that the increasing lack of willing-

    ness to participate in polls shown by manycitizens is having on the quality of surveyoutcomes (Singer, 2006).

    Since the second half of the 1980s, therehas been a constant and gradual increase insurveying nonresponse rates, non-coopera-tion of respondents being established as themain cause of this trend (de Leeuw and deHeer, 2002). The efforts of researchers werepromptly focused on trying to understand thecauses of the phenomenon and on introdu-cing new ideas, like the concept of random-ness of nonresponse, in order to try to reducenonresponse (in its two forms: non-contactedand rejections) and/or to correct its conse-quences (e.g., Groves and Couper, 1998).Social scientists and survey leaders directedtheir attention to understanding and reducingnonresponse (e.g., Groves et al. , 1999), while

    from a statistical perspective, research focu-sed on trying to minimize its impact usingtechniques such as multiple imputation (e.g.,Schafer, 1997; King et al. , 2001) and ad-

    justment and weighting methods (e.g., Isakiet al. , 2004).

    Nevertheless, once i) the inef ciency ofthe costly methods used to attempt to in-

    remaining errors are the authors sole responsibility. Theauthors acknowledge the support of the Spanish Minis-try of Science and Innovation (MICINN) through the proj-ect CSO2009-11246/CPOL.2 In general, two types of nonresponse can be found insurveys: total or partial. Total nonresponse (or non-par-ticipation) occurs either when an individual cannot bereached to be interviewed (whose distribution among thevarious political options is often assumed as basicallyrandom) or when, after being contacted, s/he declinedto be interviewed. This second type of nonresponse of-ten shows a skewed distribution among the various op-tions and is the main source of nonresponse bias .Partial nonresponse (or item nonresponse) appears whenthe subject agrees to be interviewed but provided noanswer to certain questions.

    crease participation was recognized (e.g.,Curtin et al. , 2005), ii) it was accepted, in theEnglish literature (e.g., Keeter et al. , 2000;Merkle and Edelman, 2002; Groves, 2006),the lack of a stable relationship between the

    nonresponse rate and estimation bias and iii)it was acknowledged that the statistical me-thods currently in use are not capable of co-rrecting suf ciently nonresponse bias 3, themain challenges that, according to Groves et

    al. (2002), survey methodology faced at thebeginning of this century were to i) determineunder what circumstances nonresponse candamage population inferences and ii) identifythe methods that, in the presence of nonres-ponse, can improve the quality of samplingestimates. This paper aims to provide someanswers, within the context of election fore-casting, to the second of the challenges iden-tied in Groves et al. (2002).

    Within surveys, election polls and electionpredictions play an important role as, unlikemost surveys, they may be judged against anexternal standard of comparison: the actual

    election outcomes. The sociological and pu-blic opinion forming aspects of election fore-casts therefore help to shape the image ofthe whole sector (Martin et al. , 2005). Despitethis, an analysis of the methods currently em-ployed to generate election predictions re-veals that the electoral information availableoutside the sample is used inef ciently. In-deed, the data recorded in previous electionscould be more intensely exploited using su-

    perpolulation models (e.g., Valliant et al .,2000). 4

    3 This was painfully clear in the exit-poll predictionsmade in the 2000 U.S. Presidential Election, when errors,mainly attributed to nonresponse bias (e.g., Konner,2003; Biemer et al. , 2003; Randall, 2008), provoked agreat stir.4 In a superpopulation framework, the target populationis considered a realization of a larger underlying popula-tion (superpopulation), where the individual realizationsof each member of the population show certain patternsof regularity that can be statistically exploited and, there-fore, be used to improve the quality of forecasts. From

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    3/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pava y Beatriz Larraz 239

    In this line of research, when workingwith biased samples of counted votes, theforecasting election models based on thecongruence that electoral outcomes of con-secutive elections display at polling station or

    electoral section levels have shown great ca-pacity to improve the accuracy of predictions(e.g., Bernardo and Gir n, 1992; Bernardo,1997; Pav a-Miralles, 2005; Pav a et al ., 2008)even if only partial results from the polled sta-tions are available (Pav a-Miralles and Larraz-Iribas, 2008). And, furthermore, this approachhas very recently revealed its potential withdata from exit polls (Pav a, 2010). The aim ofthis paper is to study and analyze the predic-tion capacity of this strategy when workingwith election polling data and to compare itto the procedures currently in use in the po-lling industry, more speci cally, to post-stra-tication methods and ratio estimators (e.g.,Mitofsky and Murray, 2002; Mitofsky, 2003).

    In particular, taking the 2716 and 2720CIS post-election surveys 5 as a reference(corresponding, respectively, to the 2007 Ma-drid regional election and the 2007 Barcelona

    this perspective, if the same election could be repeatedinde nitely under similar conditions (emulating in elec-toral terms the movie Groundhog Day ) the outcomesobtained with each new election would be different, butwith some regularities that would be recognized as be-longing to the same underlying population. This idea canbe extended dynamically to nd relationships betweenvoting outcomes recorded at different points in time.5 The reason for considering post-election surveys asthe basis for analysis rather than pre-election polls is dueto the need to assess both the forecasting alternativesand also the sampling designs with the least possiblenoise and, above all, to be able to use an objective,external criterion of validity. The responses collected inpost-election surveys tally with accomplished facts, sotheoretically the estimates derived from them could bedirectly compared to the results actually recorded in theelection. However, the answers re ected by pre-electionpolls are subject to statements of intent, which may varybetween the date of the survey and the polling day. Con-sequently, some of the possible deviations observedbetween the values registered in the elections and predic-tions could be due to changes of state of opinion be-tween the time of the survey and the polling day, ratherthan to technical issues. This would undoubtedly great-ly hamper the assessment.

    local election) 6, an enormous number of sam-ples from the related populations have beensimulated under three different scenarios ofrespondent behavior during interviews andtwo different strategies of sampling design.Each sample has been analyzed using fouralternative estimators and the accuracy of allestimates has been assessed in comparisonto the actual outcomes. The results show thatintroducing the outcomes recorded pre-viously in all the voting districts into the esti-mation process would signi cantly improvethe accuracy of predictions.

    The rest of the paper is organized as fo-

    llows. The second section describes the cha-racteristics of the target populations and thecriteria followed to generate the samples.The third section describes the estimatorsused and the fourth analyzes and comparesthe forecasts. In the fth section, the estima-tors are applied to the actual data collected

    6 Although national elections generally arouse greater

    interest among the public, in this study we have workedwith a regional and a local poll for logistical and techni-cal reasons. Firstly, working with a national legislativeelection involves dealing with more than 35,000 elec-toral sections (census tracts), which would have entailedenormous costs in information management. Secondly,due to the low number of representatives which are usu-ally apportioned in most of the constituencies in a gen-eral election, national predictions are less sensitive toestimation biases and are therefore initially not as inter-esting in technical terms. Thirdly, from a theoretical pointof view, the dif culty of producing accurate estimatesgrows inversely with population size and directly with thenumber of parties. That is, the larger the population, theless dif cult (in relative terms) it is to generate preciseestimates and the greater the number of candidates, themore dif cult it is to get it right. The 2007 Madrid re-gional election was chosen as an example of a situationwith a low effective number of parties, but where esti-mates are highly sensitive [In 2007, there were 111 rep-resentatives in the Madrid Assembly]. On the other hand,the 2007 Barcelona local elections were chosen to assessthe model under theoretically more unfavorable circum-stances and in a different socio-political environment.From a technical standpoint, the electorate is signi -cantly smaller and the effective number of parties clear-ly higher. Politically speaking, the Catalonian electorateis more fragmented and in 2007 in Barcelona the major-ity party was the PSC, as opposed to Madrid where itwas the PP Party.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    4/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    240 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    in the 2716 and 2720 surveys. Finally, the lastsection summarizes the conclusions drawnand critically discusses their implications.Two appendices complete the document.The rst describes how simulated vote recall

    has been generated, while the second provi-des full details of how samples and respon-dents responses were simulated.

    P OPULATIONS , S AMPLING D ESIGNS AND R ESPONSE S IMULATION According to the methodological notes ac-companying the opinion polls conducted by

    the Centro de Investigaciones Sociolgicas (CIS), the mechanism followed by this institu-tion to select sample units is de ned as astrati ed multi-stage cluster procedure con seleccin de las unidades primarias de

    muestreo (municipios) y de las unidades se-cundarias (secciones) de forma aleatoria pro-

    porcional, y de las unidades ltimas (indivi-duos) por rutas aleatorias y cuotas de sexo yedad . [with selection of primary sampling

    units (municipalities) and secondary units(sections) with probability proportional tosize, and of the last units (individuals) by ran-dom routes and sex and age quotas. ].

    In this research, the CIS sampling proce-dure as described in Rodr guez Osuna(2005) has been taken as a basis and twodifferent strategies of sampling design havebeen tested. On the one hand, we have fo-llowed a similar approach to that used by the

    CIS (hereafter CIS; see Rodr guez Osuna,1991 and 2005) and, on the other hand, wehave simulated surveys using a signi cantlyeasier and cheaper sampling design (hereaf-ter AL). In the case of the second strategy, asmaller number of sections have been ran-domly selected 7 than in the CIS samples anda relatively larger number of citizens have

    7 Through random sampling without replacement, withprobability of selection proportional to the size of thesection and without prior strati cation of the population.

    been surveyed in each selected section. Thereason for this is to study the impact thatusing a theoretically cheaper sample design(due to entailing less travel expenses andpollsters) would have on the quality of the es-

    timates.Cluster sampling (with electoral sections

    as clusters) is therefore the basis of the sam-pling technique used in the CIS and AL sur-veys analyzed in this paper 8. Having votedistributions at census tract level (i.e., the vo-tes obtained by each party in each electoralsection) is therefore essential to carry out theplanned simulation. However, more data thanthose corresponding to the elections we in-tend to forecast are required in order to ob-tain the predictions. It is also necessary i) tohave the results from the previous election ineach section and ii) to associate a vote recallto each elector. 9

    Both sets of results (the votes recorded ineach census tract in previous and currentelections) are required to perform the analy-sis: past vote recall is fundamental in both the

    ratio estimator and post-strati cation techni-ques currently in use and also in the me-thodology based on superpopulation modelsproposed as an alternative in this paper.

    However, this requirement poses two pre-liminary initial problems that must be solved.On the one hand, it is necessary to establishthe map of relations between the voting sec-tions of current and previous elections and,on the other hand, it is necessary to assign a

    vote recall to each potential respondent. To

    8 This comment could lead to the erroneous conclusionthat the methodology proposed in this paper could notbe used in telephone surveys. Nothing is further from thetruth. The CIS and AL sampling designs proposed couldbe perfectly suited to telephone polling, although we doaccept the coverage problems that telephone surveysentail. Nonetheless, coverage bias would not pose anyadditional concern in this framework as this could beregarded as nonresponse.

    9 In order to simplify the analysis, we decided to con neour study to the results recorded in the elections of thesame type held previously in the constituency.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    5/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 241

    solve the rst problem, we have employedthe classical solution 10 , which consists ofcomparing administrative codes and numberof voters assigned to each section as a me-chanism that establishes the chart of mat-

    ches between old and new polling units byapplying a set of sounded rules (Pav a-Mira-lles, 2005: 1117-8). 11

    Once the map of relationships amongsections has been established, the marginaldistributions of votes (the votes recorded inprevious and current elections) are availablein each section. However, we also need voterecalls (or cross distributions) for all electorsto apply the proposed estimators. The vote,however, is secret, so it is not possible toknow present or past voting behavior at indi-vidual level. Nevertheless, in order to assigna vote recall to each subject, one could try toexploit the aggregate information availableon the voting behavior of electors to infer theelectoral behavior of each type of voter ineach section. This issue, known as the eco-

    logical inference problem (King, 1997), issolved in this work exploiting the valuable in-formation on transfer voting that the 2716

    10 Currently, however, with the increasing availability ofgeographic information, the task of establishing corre-spondences could easily be automated (and improvedby exploiting the spatial correlation present in electionoutcomes) and even extended to situations where themanual assignment is impracticable (Pav a and L pez-Quilez, 2012).11 In particular, the basic list of rules used to track andestablish the relationships between census tracts of suc-cessive elections could be summarized as follows: (i) Adirect match is established between polling units thathave apparently not changed (under the assumption thatthe relatively small number of entrances and exits in theirvoter lists are random); (ii) When either two (or more)sections are combined to create one (or more) newsection(s), the aggregate outcomes of the original sec-tions are considered as historical data for the emergingsection(s); (iii) For those new sections which stem fromthe division of a previously existing section, the voteproportions of the original section are assigned as theirpast vote proportions; and, (iv) Either neighbourhood,city or, even, constituency average vote proportions areassigned as historical data for newly (or practically new-ly) created polling sections, due to the fact that they areusually located in the expansion areas of the cities.

    and 2720 CIS electoral polls provide using, ina two-stage process, techniques of matrixbalancing 12 (Pavia et al. , 2009). The basicidea is to estimate the voting transfer matrixamong electoral options at census tract levelin order to, depending on the group (electoraloption) to which a subject belongs to be ca-pable of assigning them a current vote and adistribution of vote recall conditioned on theircurrent vote.

    After applying the above process thetechnical details of which are shown in anexample in Appendix I we have the infor-mation about the current and previous elec-

    toral behavior of each elector in each section.These are the populations that have beenused to simulate the samples. 13

    From each of the above populations,6,000 samples of 1,000 electors (the samesizes projected in the 2716 and 2720 surveystaken as references) were extracted. Athousand samples for each of the six scena-rios obtained by combining the two proposed

    sampling design (CIS and AL) with the threehypotheses considered for voters behaviorwhen interviewed (without error, WE; withnonresponse bias, NRB; and with nonres-ponse bias and response error, RE; see

    Appendix II).

    A three-step procedure was followed tosimulate the samples. Census tracts wereselected in the rst stage. In the second sta-

    12 In particular, the RAS method has been used. Thismethod is a mechanical procedure, quite respectful withthe initial entries of the matrix, which has been widelyused in political science to infer individual behavior fromaggregate values (e.g., Johnston and Pattie, 2000,Gschwend et al. , 2003) and that receives theoreticalbacking from the information theory and entropy.13 It should be noted that the marginal distributions ofvotes used in each section coincide with the real onesand although the cross-distributions are unknown, theestimates achieved should be close to reality. In any case,all the prediction strategies analyzed compete to seewhich generates forecasts closer to actual results overthe same populations.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    6/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    242 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    ge, electors were drawn 14 in each selectedcensus tract. And, in the third stage, the cu-rrent and past vote responses of the indivi-duals selected at second stage were collec-ted. The details about how the samples and

    the responses of individuals were generatedare given in Appendix II.

    In addition to the detailed information pro-vided in Appendix II, we illustrate how thesamples were generated using as an exam-ple the steps followed to draw electors andgenerate respondents answers in an ALsample with nonresponse bias and error res-ponse of the Barcelona local election. First,we randomly selected 25 sections (with se-lection probability proportional to size of thesection) from the 1,482 electoral sections ofthe Barcelona electorate to then proceed, ineach selected section, as follows. An electoris chosen randomly from all the electors inthe section and his/her vote is observed inorder to simulate nonresponse bias by tos-sing a coin with the probability of heads andtails not necessarily being equal. Let us say

    that s/he votes for CiU. At this point, a coinwith approximately 20% probability of heads ( gure assigned to CiU voters) 15 is thrown. Ifit is heads, the voter is discarded and anotherelector is drawn from the citizens that havenot as yet been selected. If it is tails, the voterbecomes part of the sample. This process isrepeated until 40 electors from the sectionare chosen. Then, the 40 selected voterswere subjected to another coin trial to si-

    mulate the response error. More speci cally,for each elector a new coin (with a 5% pro-

    14 In all the simulated samples only resident electorshave been considered as non residents clearly cannotbe interviewed. Obviously, therefore, the votes of non-resident electors have not been taken into account foradjustments or comparisons either.15 The probability of nonresponse of each political optionwas determined from the information contained in thesurveys that served as a reference, in this case the sur-vey 2720. In the case of CiU voters, the probability ofnonresponse was established between 15 and 25 percent.

    bability of heads) was tossed. If tails comesup, his/her recall and current vote is directlyrecorded, whereas if it is heads, his/her pre-sent and past vote are generated randomly.

    E STIMATORS

    For each of the simulated samples, four alter-native estimates were obtained. First, and asa reference for comparison, direct estimates(hereafter DIR) were obtained by convertingthe direct raw answers of respondents intopercentages. This estimator is employed toascertain the extent of the bias of each parti-

    cular sample and to assess the improvementthat results from incorporating vote recall intothe prediction process. In addition to thissimple estimator, another three estimatorswere used. They all seek to make more ef -cient use of the out-of-sample informationavailable and use vote recall as auxiliary in-formation to reduce (sampling and non-sam-pling) forecasting bias. More speci cally, theyare: (i) A post-strati cation estimator (hereaf-

    ter PS) with individual-level correction (Mi-tofsky and Murray, 2002) in the variant com-monly used in Spain; (ii) A ratio-weightedestimator (hereafter RAT), with correction atconstituency level 16 ; and (iii) The estimatorproposed as an alternative in this paper (he-reafter HD) in the version presented in Pav a(2010), with corrections at section level. In allcases, estimates were constructed with thegoal of predicting the percentage of valid vo-

    tes that each of the main candidatures com-peting in the elections would achieve. Thedetails for calculating these estimates are gi-ven below.

    The direct estimator of p j, the proportionof valid votes obtained by the j th party, is

    16 Ratio-type estimators have a high reputation and arerecommended in many books on sampling (see, e.g.,S rndal et al ., 2003). In fact, an estimator of this classwas used during the 2000 US Presidential Elections toproduce the exit-poll predictions (Mitofsky, 2003).

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    7/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 243

    easily obtained as the ratio between the vo-tes that party j receives in the poll and thenumber of respondents who expressed theirintention of voting. That is, the direct predic-tion (DIR) to p j is given by equation (1).

    v j p {DIR j = ( j = 1, 2, r ), (1) v

    where v j represents the number of voters sur-veyed who declare they will vote for option

    j , v (= j v j ) comprises the total number of indi-viduals who have indicated they will vote inthese elections and r is the number of options(including blank votes) concurring in the elec-tion.

    In order to correct for nonresponse, cove-rage and measurement (response) errors,many public opinion research institutes (in-cluding CIS) typically use vote recall to weighttheir predictions to ensure that the sample ispolitically representative. Post-strati cationtechniques 17 gure prominently among thesestrategies. After grouping responses accor-

    ding to vote recall, the post-strati cation es-timator reweights each observation to gua-rantee that by applying the weights to theresponses on past votes, previous electionestimates would coincide with the actual re-sults recorded. 18

    In particular, as the simulated samples areself-weighting, the new weights would be gi-ven by equation (2) for electors who voted inthe previous election for the j th electoral par-ty19 and by equation (3) for voters who did notvote in the past call.

    17 In the presence of nonresponse, however, accordingto Kalton and Kasprzyk (1986), the commonly calledpost-strati cation estimator should be denoted the

    population weighting adjustment estimator .18 This technique, therefore, could also be seen as aparticular case of a calibration procedure, where pastoutcomes are used as an auxiliary (calibration) variable

    (e.g., S rndal, 2007).19 To make the notation less dense, it is assumed thatthe same number of parties competed in both elections.

    j ,0 0 j = , ( j = 1, 2, , r ) (2) v j ,0 n

    r

    1 h ,0 h =1 0, r + 1 = , (3) n v 0 n

    where v j,0 represents the number of voterswho said they voted for option j in the pre-vious election, v 0 (= j v j,0 ) denotes the totalnumber of electors in the survey who votedin the preceding election, j,0 represents the

    proportion of votes (over census) recordedfor option j in the previous election and n isthe effective sample size.

    The weights obtained in (2) and (3) areused as elevation factors to obtain new esti-mates. Each individual is weighted accordingto his/her vote recall. More speci cally, ifv i , j is the number of electors in the poll whochoose option j in the current election (wherev j = i v i,j ) having declared to have chosen

    option i ( i = 1.2, ..., r +1) in the previous elec-tion, the reweighted number of voters for op-tion j in the current election is obtainedthrough equation (4) and, hence, throughequation (5), the PS predictions for the pro-portions of valid vote of each party 20 .

    r +1

    v { j = 0 i v i , j ( j = 1, 2, , r ) (4) h =1

    v { j p {PS j = , ( j = 1, 2, , r ) (5) r

    v { h h =1

    As an alternative to PS forecasts (whichare acquired taking into account the indivi-

    20 The same predictions (although with different estimationerror) could also have been reached by adopting a super-population scheme in which the individual probabilities ofchange among political options were modeled exclusive-ly depending on the previous vote (Aybar, 1998).

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    8/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    244 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    dual responses of all subjects participating inthe survey grouped into strata), estimatescould also be obtained using a correction atconstituency level in the spirit of the ratio pre-dictors (see, e.g., S rndal et al., 2003: 180)

    with the initial forecasts being adjusted usingvote recall as a covariate. In this case predic-tions are achieved using two separate ratioestimates derived from the survey data andafter applying a nal correction to remove theinconsistency that the well-known lack of un-biasedness of the ratio estimator causes inthe forecasts.

    In particular, if p j,0 denotes the proportionof valid votes attained by party j in the pre-vious election and p {DIR, j ,0 represents the polldirect estimate for the proportion of votesgained by party j in the previous election (cal-culated as the ratio between those respon-dents who declared in the survey they hadvoted for party j in the past call and the totalrespondents who said they voted in previouselections), the RAT predictions are obtainedby applying equations (6) and (7) recursively.First, through equation (6), an initial predic-tion for the proportion of vote for party j , p ] j , isobtained using the classical ratio estimator.Then, in a second stage, we obtain the nalRAT estimates by applying expression (7),which re-weights all the individual estimatesgiven by the p ] j , estimates to sum to one. Thissecond stage overcomes the main drawbackof the classical ratio estimator, which almostcertainly yields a set of predictions the sumof which is not unitary. 21

    p j ,0 p ] j = p {DIR, j , (j = 1, 2, , r ) (6) p {DIR, j ,0

    p ] j p {RAT, j = , (j = 1, 2, , r ) (7) r

    p ] h h =1

    21 Following S rndal et al. (2003: 180), if recall vote es-timates are much skewed, the bias can even be awfullysigni cant.

    Finally, as an alternative to RAT and PSestimates, the HD 22 superpopulation estima-tor, which was the real motivation for this re-search, was proposed. In fact, the inspirationof this research was i) to study the capacity

    of the HD estimator to correct nonresponsebias and ii) to compare its performanceagainst both PS and RAT estimators in orderto assess whether the theoretically more in-tensive use that it makes of available infor-mation is worthwhile. As well as employingthe individual responses of all respondents toproduce forecasts, the HD estimator alsoconsiders in which census tract each respon-dent is enrolled in and, moreover, exploits thehistorical data of all the electoral sectionsthrough a superpopulation model, rather than

    just their aggregation, as is the case with PSand RAT estimators.

    This study uses the version of the HD es-timator proposed by Pav a (2010), who fo-llows a multistage procedure with differentforecasting strategies for large and small par-ties 23 . First, in each of the sections sampled,initial predictions corrected by vote recall areobtained for the proportion of votes thatwould reach each of the major parties. Se-cond, using the congruence that electoralresults from consecutive elections display insmall area bases, the proportions of votes inboth non-sampled and sampled sections arepredicted by regressing the estimates obtai-ned in the sampled sections in step one onthe outcomes recorded in these same sec-

    tions in the previous elections (Pav a-Mira-lles, 2005). In the third place, all the sectionforecasts obtained in step two are added togenerate an estimate at constituency level forlarge parties [PP and PSOE were considered

    22 The abbreviation HD (coined in Pav a et al. , 2008) de-notes historical data . The estimator is named after theintensive use of historical data for the small areas it isbased on.

    23 Pavia (2010: 73) justi es taking a different approachfor large and small parties on the basis of the law of largenumbers.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    9/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 245

    as the major political parties in the case of theMadrid regional elections and PSC, CiU andPP in the Barcelona local elections]. Finally,HD predictions for smaller parties are attai-ned by combining the forecasts obtained for

    the major parties with the direct estimates, p {DIR, j , of small parties.

    In particular, adapted to the current situa-tion, the modus operandi of the process pro-posed in Pav a (2010) would operate as fo-llows:

    i) From the survey data, initial estimatesfor the proportions of current and past votesare obtained in each electoral sampled sec-tion s (=1,2, ,N

    s ) and for each large party j

    (=1.2,.., G ).

    v js v js ,0 p ] js = , p ] js ,0 = (8) v s v s ,0

    ( s = 1, 2, ,N s y j = 1, 2, , G ),

    where v js ( v js ,0 ) denotes the number of res-pondents in the section s who declare a votefor party j in the current (previous) election,v s =

    r j =1 v js ( v s ,0 ) represents the total respon-

    dents in the section who declare a vote in thecurrent (previous) election and N s is the num-ber of sections sampled.

    ii) Using actual values recorded in theprevious election in the sampled sections(and assuming that entrances and exits insection voter lists are random), vote recalladjusted estimates, p js, are obtained for the

    proportions of votes that each party j wouldattain in each section s by: 24

    p js = p ] js + ( p js0 p ] js0 ),

    ( s = 1, 2, , N s y j = 1, 2, , G ), (9)

    24 Correcting nonresponse bias at section level makesthe process more exible, due to as Pav a (2010: 70)points out, [i]t allows for a different bias mechanism foreach polling place and for the magnitude and even thedirection of the bias to vary across locations.

    where p js ,0 denotes the proportion of votesrecorded in section s by party j in previouselections.

    iii) There is assumed to be a linear rela-tionship between the proportion of actual (p js )and previous votes ( p js ,0 ) for each party atsection level, with (for simplicity) zero meannormal disturbances with constant correla-tion between parties and conditional inde-pendence between sections; i.e.:

    p js = j + j p js ,0 + js ,

    ( s = 1, 2, , N y j = 1, 2, , r ), 10)

    js being 0-mean normal disturbances veri-fying E( js, j *s* ) = ss * jj * (where is Kronecher sdelta function), j and j unknown parametersand N the number of sections in the consti-tuency.

    iv) Using the large party predictions ob-tained at section level in ii) and de ning

    p G+1, s = 1 G j =1 p js as the estimated propor-tion of votes gained for the remaining options(OT) in section s , the parameters j and j (for

    j = 1,2, ,G+1) of (10) are estimated via theiterative algorithm proposed in Pav a-Miralles(2005, 1121) 25 and conditioned on these pa-rameter estimates, predictions for the pro-portion of votes gained by the major partiesin each section are reached through:

    p { js = { j + { j p js ,0 ,

    ( s = 1, 2, , N y j = 1, 2, , G), (11)

    v) Once estimates for the major partiesare available in all census tracts, the sectionforecasts are added up to reach constituencyHD estimates, using equation (12) for large

    25 In order to simplify the estimation process, it is as-sumed that errors in measuring the dependent variableare absorbed by the error term (see, Greene 2003: 84).Pav a-Miralles and Larraz-Iribas (2008) offer an alternativealgorithm to deal with this issue when the measurementerrors in the dependent variable are considered explic-itly.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    10/28

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    11/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 247

    se closest to real conditions. A broad sum-mary of several relevant aspects that are im-portant to analyze the simulations are

    provided in Tables II and III and in Figure 1.Table II, classi ed by scenario and estimator,includes forecast means and associated bia-ses. The average values of the goodness-of-t measures and a comparison of the distri-butions of the degree of adjustment of all theestimates, measured by the percentage oftimes that each estimator generates the bestsolution in terms of entropy, are displayed inTable III. Finally, the box and whisker plots in

    Figure 1 show the distributions of the predic-tions obtained following the CIS samplingdesign.

    Of the two elements that de ne each sce-nario, the willingness of voters to participatein the survey is the question that has thegreatest impact on estimate accuracy. In fact,the impact of sampling design can even beclassi ed as minor. This is not innocuous,however. On average, the estimates achievedusing the CIS design are slightly closer to ac-tual values than those obtained with the AL

    strategy, although both sets of predictionsshow fairly comparable t levels. The reduc-tion in the number of sampling points (and in

    their spatial distribution) that the AL designinvolves (compared to CIS plan) leads, on theone hand, to a petty increase in the bias es-timation of the percentages for major parties(see Table II) and, on the other hand (as ex-pected), a slight increase in the variability ofpredictions. Nonetheless, these changes donot apply equally to all estimators. The HDestimator suffers the least, in terms of sampleunbiasedness and variability 27 , a change inthe sampling plan, and improves its relativeposition in regard to its competitors PS andRAT estimators, see Table III.

    The impact on forecasts of the assump-tions about the behavior of voters when po-lled, however, is more evident. Under idealconditions (i.e., when samples are generatedwithout error) all the estimators produce, as

    27 In fact, when nonresponse bias is present, the HDestimator displays even lower levels of variability.

    TABLE I . Error measures used to assess goodness-of-t of forecasts

    Description Acronyms Equations (1)

    1 MSE

    k ( p k p { x , k )2 Mean Square Error r

    1Root Mean Square Error RMSE k ( p k p { x , k )2 r 1 AME

    k p k p { x , k Absolute Mean Error r

    100 p k p { x , k Relative Mean Error RME k r p k

    p k p { x , k Entropy ENT k p k log 1 100

    Source: Own elaboration.(1) r denotes the number of policy options for which the joint adjustment is calculated, p k the actual percentage of votesrecorded for the k th option, and p { x , k the estimate of the percentage of votes gained by party k after applying the correspon-ding estimator, with X = DIR RAT, PS and HD.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    12/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    248 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    expected, highly accurate predictions, with anegligible average bias. However, generallyspeaking we can say that the PS estimatorregisters the best t supported largely by thefact it produces the best predictions for smallcandidatures followed by the HD and RATpredictors (with similar gures) and the DIRestimator, which clearly shows superior le-vels of error. With CIS design, nevertheless,the DIR estimator generates the best solutionslightly more times than HD and RAT estima-tors (see Table III), which combined with theabove statement suggests it is less robust(see Figure 1). That is, when it errs, it errs by

    more. In short, we can say that even in cir-

    cumstances where direct estimation has notheoretical disadvantage, vote recall correc-tion helps to improve estimate accuracy.

    Voters behavior simulated in the idealscenarios, however, is far from realistic. In thepresence of asymmetric nonresponse, allfour estimators are hit hard. DIR estimatesshow a signi cant bias and although the useof the auxiliary information provided by voterecall substantially reduces the magnitude ofthe bias, it remains appreciable. Notwithstan-ding, the HD strategy does reduce bias the

    TABLE II . Forecast and estimation error averages (1) of the percentages of votes for the main parties contestingthe 2007 Madrid Assembly regional election

    Percentages Errors (3)Escenario Estimador

    PP PSOE IU OT PP PSOE IU OT

    DIR 53.43 33.13 8.80 4.65 0.29 0.21 0.07 0.02CIS_WE PS 53.12 33.32 8.95 4.61 0.02 0.01 0.09 0.06 RAT 52.78 33.35 9.03 4.84 0.36 0.01 0.17 0.17 HD 53.16 33.15 8.94 4.75 0.02 0.18 0.08 0.08

    DIR 52.86 33.67 8.84 4.62 0.28 0.34 0.02 0.04 AL_WE PS 52.89 33.65 8.85 4.61 0.25 0.32 0.01 0.06 RAT 52.71 33.59 8.88 4.82 0.43 0.26 0.01 0.16 HD 53.03 33.46 8.85 4.66 0.11 0.13 0.01 0.01

    DIR 51.13 34.72 9.30 4.84 2.01 1.39 0.44 0.18CIS_NRB PS 51.56 34.51 8.95 4.97 1.58 1.18 0.09 0.31 RAT 51.65 34.33 8.81 5.21 1.49 0.99 0.05 0.54

    HD 52.02 34.09 9.12 4.77 1.12 0.76 0.26 0.10 DIR 50.95 34.86 9.36 4.83 2.19 1.52 0.49 0.17

    AL_NRB PS 51.54 34.61 8.88 4.97 1.60 1.28 0.02 0.31 RAT 51.75 34.43 8.67 5.16 1.39 1.09 0.20 0.49 HD 52.11 34.22 8.99 4.68 1.02 0.88 0.13 0.01

    DIR 49.99 34.08 10.11 5.83 3.15 0.74 1.24 1.16CIS_RE PS 50.78 34.18 9.33 5.70 2.35 0.84 0.47 1.04 RAT 51.99 34.54 8.73 4.74 1.15 1.20 0.13 0.08 HD 52.12 34.34 8.58 4.97 1.02 1.00 0.28 0.30

    DIR 49.70 34.42 10.13 5.76 3.44 1.08 1.27 1.09 AL_RE PS 50.69 34.41 9.27 5.64 2.45 1.07 0.41 0.97 RAT 51.98 34.70 8.61 4.71 1.16 1.37 0.25 0.04 HD 52.07 34.54 8.52 4.87 1.06 1.21 0.35 0.20Eleccion Outcomes (2) 53.14 33.33 8.86 4.67

    Source: Own elaboration.(1) Mean values from 1,000 simulated samples.(2) Percentage of valid votes recorded in the resident population.(3) Computed as the average percentage difference between estimated and real values.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    13/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 249

    most and records the best scores in terms offorecasting accuracy. Indeed, it could be ar-gued that the larger the bias of the sample,the better the HD estimator works (in relativeterms). In fact, while the HD predictor is as arule the best (slightly more than) a third of thetimes (see Table III), we found that this gurerises to almost 60% when considering exclu-sively the hundred samples with the mostbias. The HD estimator also triumphs inpairwise comparisons reversing the trend ob-served with ideal samples after improving itsgures against the RAT predictor and espe-

    cially by advancing in comparison to the PSestimator. In fact, the relatively low variabilityshowed by PS estimates together with theirpersistent bias, indicates that the PS estima-tor has the lowest success rates in these sce-narios.

    The picture is quite similar in RE scena-rios, in spite of in comparison to NRB set-tings RE samples see their average levels ofunbiasedness decrease (albeit minimally) forall estimators. Although the convergencethat occurs between HD and RAT estimatorsstands out, the order of preference (HD,

    TABLE III . Summary of goodness-of- t measures between actual and estimated joint vote distributions. 2007Madrid Assembly regional elections

    % success over 1000 simulations (1) Goodness-of-t-measures (2)(3)

    Scenario Estimador % PS v PS v RAT v

    ENT MSE RMSE AME RME success HD RAT HD

    DIR 24.0 1.62 2.70 1.47 1.24 8.12CIS_WE PS 33.0 65.3 64.7 1.07 1.35 1.05 0.90 6.81 RAT 22.0 35.3 53.7 1.30 2.08 1.31 1.12 9.20 HD 21.0 34.7 46.3 1.36 2.13 1.32 1.12 8.33

    DIR 18.4 2.38 5.62 2.03 1.68 9.31 AL_WE PS 35.0 60.7 58.7 1.23 1.69 1.16 0.99 6.99 RAT 23.0 41.3 49.2 1.41 2.29 1.37 1.17 9.08 HD 23.6 39.3 50.8 1.41 2.22 1.35 1.15 8.20

    DIR 25.1 2.05 4.01 1.76 1.46 8.37CIS_NRB PS 18.8 43.8 51.7 1.50 2.20 1.33 1.12 7.20

    RAT 21.2 48.3 39.2 1.50 2.48 1.41 1.20 9.01 HD 34.9 56.2 60.8 1.38 2.06 1.28 1.08 7.41

    DIR 22.5 2.68 6.73 2.22 1.83 9.72 AL_NRB PS 22.1 42.6 50.4 1.62 2.50 1.42 1.20 7.56 RAT 24.8 49.6 39.3 1.59 2.74 1.49 1.27 9.35 HD 30.6 57.4 60.7 1.44 2.20 1.33 1.13 7.79

    DIR 16.4 2.51 5.74 2.21 1.87 13.12CIS_RE PS 12.1 30.2 29.5 1.87 3.19 1.66 1.40 10.25 RAT 34.0 70.5 49.1 1.48 2.35 1.38 1.17 8.16 HD 37.5 69.8 50.9 1.46 2.31 1.36 1.15 7.91

    DIR 15.1 3.08 8.68 2.62 2.20 13.93 AL_RE PS 13.0 27.4 31.3 2.01 3.66 1.75 1.49 10.32 RAT 35.4 68.7 45.0 1.56 2.62 1.45 1.23 8.36 HD 36.5 72.6 55.0 1.51 2.47 1.41 1.20 8.11

    Source: Own elaboration.(1) Percentage of samples for which the corresponding estimator achieves a better t in terms of entropy.

    (2) Mean values from 1,000 simulated samples. (3) ENT: Entropy; MSE: Mean square error; RMSE: Root of MSE; AME: Absolute mean error; RME: Relative mean error.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    14/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    250 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    RAT, PS and DIR) obtained in NRB scenariosremains valid. In terms of overall adjustment,the PS predictor once again suffers the most(see Table III). In fact, in the race of the fourestimators, it only works better in just over10% of cases and drops back to odds of 7to 3 in the pairwise comparisons against HDand RAT estimators. The post-strati cationstrategy encountered the most serious pro-blems when predicting the two major par-ties, with the largest errors concentrated inPP estimates. In light of the results, it can beargued that the PS estimator corrects rawresponses the least out of the three estima-tors based on vote recall and also yields the

    predictions that are closest to DIR estima-tes.

    Nevertheless, despite the HD estimatorbeing the best in both RE and NRW scenariosand the fact that the con dence intervals ofHD predictions 28 had included the true valuein a large percentage of cases and that con-sequently there would have not have beenerrors but uncertainty, from a statistical stan-dpoint, average HD results are still a signi -cant distance from true values. Consequently,

    28 Obtained after taking the variance of the 1,000 simu-lations as an estimate of the sampling variance.

    FIGURE 1. Box and Whisker plots of all the estimates obtained using CIS sampling design for PP (top left),PSOE (top right), IU (bottom left) and others (bottom right).

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    15/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 251

    there is still room to search for more accurateestimators.

    Barcelona Local Elections

    The simulation outcomes obtained for the2007 Barcelona local elections are discussedin this subsection. Compared to the Madrid

    Assembly elections, the electoral scene inBarcelona has an evident multiparty structureand a considerably smaller electorate. Bothfactors make it more dif cult to obtain accu-

    rate estimates. Tables IV and V are the name-sakes of Tables II and III in the previous sub-section, while Figure 2 in this case providesthe estimate distributions obtained afterapplying the AL sampling design.

    Barcelona election simulation results rein-force the ndings reached using data fromthe Madrid elections. Despite the situationbeing more complex, the electorate beingsmaller in size and the political landscapemore fragmented, the general trends andoverall results obtained previously are con r-

    TABLE IV . Forecast and estimation error averages (1) of the percentages of votes for the main parties contestingthe 2007 Barcelona local elections

    Percentages Errors (3)Scenario Estimator PSC CiU PP ICV ERC OT PSC CiU PP ICV ERC OT

    DIR 30.10 25.33 15.58 9.36 8.84 10.79 0.14 0.13 0.03 0.01 0.01 0.02CIS_WE PS 30.19 25.40 15.78 9.17 8.87 10.59 0.23 0.05 0.17 0.17 0.02 0.20 RAT 30.05 25.29 15.68 9.34 8.74 10.90 0.09 0.16 0.07 0.01 0.11 0.12 HD 30.18 25.04 15.64 9.41 8.88 10.85 0.22 0.42 0.03 0.06 0.03 0.08

    DIR 29.97 25.30 15.64 9.49 8.81 10.79 0.01 0.16 0.04 0.15 0.04 0.00 AL_WE PS 30.06 25.47 15.82 9.24 8.85 10.56 0.10 0.01 0.22 0.11 0.01 0.21 RAT 29.93 25.44 15.72 9.36 8.74 10.81 0.03 0.02 0.12 0.01 0.11 0.03 HD 30.17 25.19 15.68 9.45 8.76 10.75 0.21 0.26 0.08 0.10 0.10 0.03

    DIR 33.99 26.46 9.20 11.46 9.11 9.78 4.03 1.01 6.41 2.12 0.26 1.01CIS_NRB PS 31.27 25.11 13.01 11.87 8.63 10.11 1.31 0.35 2.59 2.52 0.22 0.67 RAT 30.07 24.62 14.14 11.31 8.73 11.13 0.11 0.83 1.47 1.96 0.13 0.36 HD 30.02 24.84 14.85 11.44 9.08 9.77 0.06 0.61 0.75 2.09 0.23 1.02

    DIR 34.14 26.36 9.16 11.56 9.00 9.78 4.19 0.91 6.45 2.22 0.15 1.02 AL_NRB PS 31.38 25.10 12.93 11.95 8.55 10.09 1.42 0.36 2.68 2.61 0.30 0.69 RAT 30.17 24.66 14.05 11.38 8.66 11.08 0.21 0.79 1.55 2.03 0.19 0.29 HD 30.11 24.83 14.76 11.55 8.98 9.77 0.15 0.63 0.85 2.20 0.13 1.00

    DIR 32.80 25.49 9.45 11.72 9.45 11.09 2.84 0.04 6.15 2.37 0.60 0.30

    CIS_RE PS 30.99 24.69 12.71 12.00 8.92 10.69 1.03 0.77 2.90 2.66 0.07 0.09 RAT 30.80 24.99 14.58 11.68 9.19 8.76 0.84 0.47 1.02 2.33 0.34 2.02 HD 30.18 24.60 14.92 11.01 8.87 10.42 0.22 0.86 0.69 1.66 0.02 0.35

    DIR 32.96 25.55 9.44 11.62 9.43 11.00 3.00 0.10 6.17 2.27 0.58 0.22 AL_RE PS 31.05 24.75 12.67 11.96 8.92 10.65 1.09 0.71 2.93 2.61 0.07 0.13 RAT 30.78 25.03 14.56 11.66 9.21 8.76 0.82 0.43 1.04 2.31 0.36 2.02 HD 30.22 24.58 14.91 10.98 8.90 10.41 0.26 0.87 0.70 1.63 0.05 0.37

    Eleccion Out- comes b 29.96 25.46 15.61 9.35 8.85 10.77

    Source: Own elaboration.(1) Mean values from 1,000 simulated samples.(2) Percentage of valid votes recorded in the resident population.(3) Computed as the average percentage difference between estimated and real values.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    16/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    252 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    med. As regards comparisons between CIS

    and AL sampling, similar conclusions to tho-se attained with the Madrid simulations arereached again. In general, selecting a smallernumber of sampling points (combined withinterviewing a larger number of electors ateach sampled point) hardly affects the accu-racy of the estimates, with the HD estimatorsuffering the least (if at all) the sampling stra-tegy shift. Therefore, it could stated that -nancial criteria (and issues related to the es-timation accuracy of other variables alsosurveyed) should be weighed up in order to

    decide, by way of a cost-bene t analysis,

    whether it is better to follow a CIS samplingplan or a simple sampling plan such as the ALdesign analyzed in this study, especially if theestimator adopted is the HD.

    Likewise, focusing our attention now onthe impact that voters behavior has on pre-dictions, we found that the results are againin line with those achieved in the previoussubsection. The PS predictor once again ge-nerates the best forecasts under ideal condi-tions. This, however, does not disable itscompetitors, due, as in the case of Madrid, to

    TABLE V . Summary of goodness-of- t measures between actual and estimated joint vote distributions. 2007Barcelona local elections

    % success over 1.000 simulations (1) Goodness-of- t measures (2)(3)

    Scenario Estimator % PS v PS v RAT v

    ENT MSE RMSE AME RME success HD RAT HD

    DIR 20.3 1.53 3.07 1.65 1.35 9.14CIS_WE PS 35.3 72.0 66.0 1.10 1.69 1.23 1.02 7.19 RAT 25.0 34.0 58.0 1.26 2.39 1.45 1.19 8.56 HD 19.4 28.0 42.0 1.34 2.43 1.47 1.22 8.49

    DIR 16.8 1.80 4.14 1.89 1.55 10.20 AL_WE PS 35.8 63.9 63.8 1.15 1.81 1.27 1.06 7.51 RAT 24.0 36.2 48.1 1.30 2.51 1.49 1.23 8.78 HD 23.4 36.1 51.9 1.29 2.29 1.43 1.19 8.43

    DIR 0.2 3.18 13.03 3.56 2.77 17.86CIS_NRB PS 17.4 28.0 28.0 1.56 3.82 1.90 1.57 11.73

    RAT 38.1 72.0 46.0 1.36 3.15 1.68 1.39 10.56 HD 44.3 72.0 54.0 1.30 2.69 1.57 1.31 10.04

    DIR 0.3 3.43 14.73 3.76 2.95 18.81 AL_NRB PS 15.2 22.9 27.9 1.69 4.29 2.01 1.68 12.37 RAT 33.8 72.1 39.0 1.46 3.49 1.77 1.48 11.13 HD 50.7 77.1 61.0 1.34 2.91 1.63 1.35 10.42

    DIR 1.4 2.71 10.91 3.25 2.50 16.93CIS_RE PS 10.9 19.0 33.0 1.62 4.23 2.00 1.64 12.18 RAT 24.2 67.0 30.0 1.47 3.70 1.85 1.54 12.07 HD 63.5 81.0 70.0 1.25 2.31 1.45 1.21 8.91

    DIR 0.8 3.00 12.29 3.43 2.67 17.54 AL_RE PS 13.6 18.3 35.0 1.67 4.39 2.03 1.67 12.32 RAT 19.9 65.0 24.2 1.51 3.80 1.87 1.57 12.25 HD 65.7 81.7 75.8 1.25 2.31 1.44 1.20 8.86

    Source: Own elaboration.(1) Percentage of samples for which the corresponding estimator achieves a better t in terms of entropy.(2) Mean values from 1,000 simulated samples.(3) ENT: Entropy; MSE: Mean square error; RMSE: Root of MSE; AME: Absolute mean error; RME: Relative mean error.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    17/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 253

    FIGURE 2. Box and Whisker plots of all the estimates obtained using AL sampling design for PSC (top left), CiU(top right), PP (middle left), ICV (middle right), ERC (bottom left) and others (bottom right)

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    18/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    254 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    the predictions obtained using this predictoralso being highly accurate (see Tables IV and

    V)29 . On the other hand, the HD estimatoremerges again as the best when the sampleshave been simulated with nonresponse bias

    and response error. In this case, moreover, itscomparative advantage against the RAT esti-mator increases substantially, while the supe-riority over the PS estimator is already huge.Most likely, the growth in terms of the relativeadvantage experienced by the HD estimatoris due to the fact that, in this case, we musttackle signi cantly more biased samples, anissue that, as was discussed previously,seems to favor HD estimates. Indeed, as canbe observed in Table IV, this seems to be thecause in NRW and RE scenarios, DIR pre-dictions display huge differences in bias,even exceeding six points on average in thecase of the PP.

    As regards the parties themselves, it isworth highlighting the forecasts obtained forICV and CiU, on the negative side, and theestimates achieved for PSC and PP on thepositive side. ICV accumulates the highestprediction errors. None of the estimatorsused seem to have been able to signi cantlyreduce the bias present in the raw data forthis party (see Figure 2). The case of CiU issomewhat different. CiU tends to display po-sitive bias in samples simulated with nonres-ponse bias. Although this appears to be de-tected by all the strategies, the fact is that allof them over-correct it, resulting on average

    in signi cantly negatively biased predictions.In both cases, in order to gain insight into thecauses of the situations described above, itwould be interesting to study the spatial dis-tribution of their support and to analyze whe-ther there were any speci c anomalies in vo-ting transfer among parties between the 2003and 2007 elections. At the other end of thescale it is worth highlighting the predictions

    29 In fact, RAT estimates work even better on this occa-sion than PS estimates in terms of bias.

    achieved for PSC and PP. In this case, despi-te having raw data spectacularly biased, allestimators are observed to have made im-portant corrections, which in the case of theHD predictor led to fairly accurate forecasts.

    Finally, it should be noted that despite thesuperiority of the HD estimates being ob-vious in the most realistic circumstances andthe fact that their average forecasts are alsonow relatively closer to actual outcomes, theobservations made in the previous paragraphcon rm that there is still room for improve-ment and researchers should search for moreaccurate strategies. In any case, it seems evi-dent that in the presence of asymmetric non-response, the HD strategy clearly outper-forms the RAT and PS strategies, which arethe ones currently in use in the electoral po-lling industry.

    E STIMATING WITH REAL DATA : 2716 AND 2720 CIS POLLSThe conclusions of the simulation exercise

    presented in the previous section show that,in the presence of nonresponse bias, the HDestimator generates the best estimates forvote distribution more frequently. This sectionprovides the predictions that would be obtai-ned if the four estimators discussed in thisresearch were applied to the actual data co-llected in the surveys used as a reference inthis study.

    So far we have considered only one typeof nonresponse: Total. In real surveys, howe-ver, is very common to observe also anothertype of nonresponse: Partial. That is, to ndindividuals that have only answered to someof the posed questions. In these situations,analysts must decide whether to: i) excludethe individuals for whom there is no responsein all relevant variables from the analysis or;ii) adopt a theoretically more ef cient appro-ach and use all the available sample informa-tion to predict (impute) the missing values ofthe relevant variables.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    19/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 255

    In our case, the relevant responses to fo-recast are the answers to the questions onactual voting and vote recall. The 2716 and2720 CIS surveys, which were intended tohave a sample size 1,000, effectively had 969and 974 observations, respectively. However,only 807 and 728 people surveyed gave res-pectively a valid response to the two itemsrequired to apply vote-recall based strate-gies. A few more respondents provided fee-dback at least about their actual vote: 881and 826, respectively. In light of these data,we have considered three different scenariosin which to use the data collected in the sur-veys to yield predictions. In a rst stage, onlythe raw data corresponding to individualswho report their current and previous vote are

    used. This approach however is not theoreti-cally ef cient as it makes no use of a lot ofinformation still available in the survey. Thus,we have considered two additional scenariosin which the relevant responses that are mis-

    sing (Don t know/No answer) were estimatedusing imputation techniques. In particular, asecond scenario of simple imputation, inwhich we have imputed a previous vote tothose respondents who gave their actualvote, but did not report vote recall, and a thirdsetting of double imputation, in which wehave attempted to impute a current and pre-vious vote for respondents who do not an-swer either question.

    Consequently, the problem now is cho-osing an imputation method from those that

    TABLE VI . Forecasts and estimation errors of the percentages of votes predicted using 2716 CIS poll data forthe main parties contesting the 2007 Madrid Assembly elections

    Percentages Errors (3)Scenario (1) Estimator

    PP PSOE IU OT PP PSOE IU OT

    Oct-2003Elections Outcomes (2) 48.47 39.04 8.51 3.98

    Raw DIR 46.55 37.67 11.18 4.59 6.58 4.34 2.32 0.07Data PS 48.17 35.47 11.12 5.24 4.97 2.14 2.26 0.57n = 807 RAT 48.92 33.72 11.23 6.13 4.22 0.39 2.37 1.46 HD 50.23 32.93 11.94 4.91 2.91 0.41 3.08 0.24

    Simple DIR 47.65 36.42 10.67 5.26 5.49 3.08 1.81 0.60Imputation PS 48.93 35.10 10.51 5.47 4.21 1.76 1.65 0.80n = 881 RAT 49.80 33.87 10.85 5.48 3.34 0.53 1.99 0.82 HD 50.68 32.91 10.99 5.42 2.45 0.42 2.12 0.75

    Double DIR 48.51 35.41 10.81 5.27 4.62 2.07 1.95 0.60Imputation PS 49.57 34.33 10.46 5.64 3.57 1.00 1.60 0.97n = 929 RAT 50.19 33.38 10.64 5.78 2.94 0.05 1.78 1.11 HD 51.12 32.39 11.08 5.40 2.01 0.94 2.22 0.74

    2007Eleccions Outcomes b 53.14 33.33 8.86 4.67

    Source: Own elaboration using data from 2716 CIS survey.(1) Raw data: Without imputation, only individuals for which current vote and vote recall are observed are used; SimpleImputation: Vote recall is imputed for those respondents for which current vote is observed; Double Imputation: Either actual

    vote, vote recall or both variables are imputed when unobserved; n effective sample size used.(2) Percentage of valid votes recorded in the resident population.(3) Computed as the average percentage difference between estimated and real values.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    20/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    256 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    have been proposed in the literature re-gression imputation, random imputation,mean imputation, nearest neighbor imputa-tion, multiple imputation, expert imputation,hot deck, or cold-deck (eg, Schafer , 1997;S rndal and Lundstr m, 2005, Galvan andMedina, 2007). In this case, in order to makethe solution workable and meaningful, impu-tation by expert judgment was chosen (S rn-

    dal and Lundstr m, 2005: 164-5), using asinformative variables interviewees responsesto questions such as: leaders evaluation,ideological self-identi cation, proximity toparties, ideological allocation of parties onbehalf of the respondent, respondent s votein other elections, evaluation of electoraloutcomes and assessment of policies.

    Tables VI and VIII display the results of thepredictions obtained for the 2007 Madrid As-sembly and the 2007 Barcelona local elec-tions, respectively, after employing the threesets of data (without imputation, with simpleimputation and double imputation) describedin the paragraph above. Tables VII and IX, onthe other hand, show the goodness-of- t sta-tistics of the estimated distributions.

    As can be easily deduced by observingTable VI, the 2716 survey raw results are ex-tremely biased and although imputations re-duce the bias signi cantly, it remains signi -cantly high (see DIR forecasts). In the case of

    Madrid, the predictions that best t the actualoutcomes are obtained using the HD estima-tor, displaying a signi cant advantage overRAT estimates, which rank second, and PSpredictions, which are far from the actual re-sults despite improving clearly on the directforecasts (see Table VII). The combination ofimputation and vote-recall correction seemsto reduce bias and improves the overall ac-

    curacy of all the forecasts. However, this im-provement does not affect all strategiesequally. The HD estimator improves the least,while the other predictors further improvetheir performance after imputation with mar-ked advances in their overall t.

    In the case of Barcelona, the results aresomewhat different. In fact, the data are evenslightly more biased after imputation thanthey were before, despite the raw data al-ready showing signi cant levels of bias. Fur-thermore, this case highlights the huge over-correction that vote recall induces in CiU andPP, and also in PSC, estimates. The latterprobably explains why, without imputation, allestimators based on vote-recall correctionsperform worse this time than the direct esti-mator (see Table VIII).

    Out of the three estimators based on vo-te-recall, the RAT predictor performs clearlyworse than the PS and HD strategies. Thedifferences between the last two, however,

    Table VII . Goodness-of- t measures between actual and estimated joint vote distributions obtained using the 2716 CIS survey. 2007 Madrid Assembly regional elections (1),(2)

    Raw Data Simple Imputation Double Imputation

    Estimador ENT RMSE AME RME ENT RMSE AME RME ENT RMSE AME RME

    DIR 5.31 4.11 3.33 13.27 4.23 3.29 2.74 13.19 3.42 2.73 2.31 12.47PS 3.66 2.94 2.48 13.38 3.06 2.46 2.11 12.25 2.45 2.07 1.78 12.14RAT 2.70 2.53 2.11 16.78 2.20 2.00 1.67 11.96 1.81 1.81 1.47 12.41HD 1.99 2.13 1.66 11.65 1.69 1.68 1.44 11.51 1.63 1.61 1.48 11.87

    Source: Own elaboration.(1) Raw data: Without imputation, only individuals for which current vote and vote recall are observed are used; SimpleImputation: Vote recall is imputed for those respondents for which current vote is observed; Double Imputation: Either actualvote, vote recall or both variables are imputed when unobserved; n effective sample size used.(2) ENT: Entropy; RMSE: Root of MSE; AME: Absolute mean error; RME: Relative average error.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    21/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 257

    TABLE VIII . Forecasts and estimation errors of the percentages of votes predicted using 2720 CIS poll data forthe main parties contesting the 2007 Barcelona local elections

    Percentages Errors (3)Scenario (1) Estimator PSC CiU PP ICV ERC OT PSC CiU PP ICV ERC OT

    2003 Out -

    Elect. comes (2) 33.57 21.45 16.12 12.07 12.81 3.97

    Raw DIR 31.41 26.99 11.21 11.38 8.66 10.36 1.45 1.54 4.40 2.03 0.19 0.43data PS 27.65 22.47 16.17 12.59 8.82 12.30 2.31 2.99 0.57 3.24 0.03 1.51n = 728 RAT 26.15 21.16 17.43 10.51 9.02 15.73 3.81 4.30 1.82 1.16 0.17 4.95 HD 27.85 22.17 17.50 11.16 9.26 11.07 2.11 3.28 1.89 2.81 0.40 0.29

    Simle DIR 32.31 26.58 9.37 11.66 8.99 32.31 2.36 1.12 6.24 2.32 0.14 0.31Imputation PS 28.01 22.24 15.50 12.61 8.90 12.73 1.95 3.21 0.10 3.27 0.05 1.95n = 826 RAT 26.03 20.77 17.81 10.39 8.95 16.04 3.93 4.68 2.21 1.04 0.10 5.26 HD 27.39 22.78 17.41 11.91 9.18 11.33 2.57 2.68 1.80 2.57 0.33 0.55

    Double DIR 33.26 26.90 8.83 11.50 9.24 10.27 3.31 1.44 6.78 2.15 0.39 0.52Imputation PS 27.36 22.06 17.00 12.65 8.96 11.97 2.59 3.40 1.39 3.30 0.11 1.19n = 888 RAT 24.88 20.10 19.97 9.87 8.91 16.28 5.08 5.36 4.37 0.52 0.05 5.49 HD 27.76 23.18 17.91 11.55 9.28 10.31 2.19 2.28 2.30 2.20 0.43 0.47

    2007 Out- Elect. comes (2) 29.96 25.46 15.61 9.35 8.85 10.77

    Source: Own elaboration using data from the 2720 CIS survey.(1) Raw data: Without imputation, only individuals for which current vote and vote recall are observed are used; SimpleImputation: Vote recall is imputed for those respondents for which current vote is observed; Double Imputation: Either actualvote, vote recall or both variables are imputed when unobserved; n effective sample size used.(2) Percentage of valid votes recorded in the resident population.(3) Computed as the average percentage difference between estimated and real values.

    TABLE IX . Goodness-of- t measures between actual and estimated joint vote distributions obtained using the 2720 CIS survey. 2007 Barcelona local elections (1), (2)

    Datos brutos Imputaci n simple Imputaci n doble

    Estimator ENT RMSE AME RME ENT RMSE AME RME ENT RMSE AME RME

    DIR 1.79 2.17 1.67 11.15 2.27 2.92 2.08 13.56 2.77 3.27 2.43 15.39PS 2.04 2.14 1.77 12.02 1.96 2.18 1.76 12.23 2.34 2.33 2.00 13.08RAT 3.24 3.22 2.70 16.92 3.46 3.44 2.87 17.79 4.32 4.16 3.48 20.52HD 2.12 2.12 1.80 11.57 2.08 2.00 1.75 11.15 1.91 1.85 1.65 10.64

    Source: Own elaboration.(1 Raw data: Without imputation, only individuals for which current vote and vote recall are observed are used; Simple

    Imputation: Vote recall is imputed for those respondents for which current vote is observed; Double Imputation: Either actualvote, vote recall or both variables are imputed when unobserved; n effective sample size used.(2) ENT: Entropy; RMSE: Root of MSE; AME: Absolute mean error; RME: Relative mean error.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    22/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    258 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    are not conclusive. In terms of entropy, thePS estimator records the best results (seeTable IX), although its advantage over the HDestimator is rather scanty and may even bequestioned. Indeed, on the one hand we ob-serve that, with double imputation, the HDestimator improves on the PS estimator interms of entropy while, on the other hand,with raw and simple imputed data the diffe-rences between both sets of estimates (PSand HD) are minimal or even favour the PS.Choosing the best therefore depends on theindicator used to measure the goodness oft. In fact, in the estimates without imputa-tion, each estimator records the best ts fortwo of the four indicators when, in the otherscenarios, the PS estimator yields the best

    just once. Nevertheless, any alleged advan-tage of the PS estimator would be based inthis case on that fact that its predictions forPP were more accurate.

    C ONCLUSIONS

    When dealing with votes counted, the super-population models based on the congruencethat the aggregate electoral results of conse-cutive elections display on small scales (po-lling boxes, electoral sections, voting sta-tions) have helped to signi cantly improvethe predictions obtained with biased sam-ples. Parallel to this, nonresponse bias isseen to be a growing problem in polls, wherebiased samples are the norm. The aim of thisresearch is to study the predictive power ofthese methods in a survey environment andto assess their performance against the esti-mators currently used by the industry. Inaddition to this, the study also seeks to as-certain whether the use of a sampling selec-tion procedure (probably less costly in termsof both money and time) whereby a greaternumber of electors in a smaller number ofcensus tracts are selected (AL sampling)could be implemented without impairing thequality of the estimates.

    In order to answer these questions, wehave performed a complex simulation exerci-se for two different elections (2007 Madrid

    Assembly and 2007 Barcelona local elections)generating an enormous amount of samples

    under different scenarios of voter behaviorwhen interviewed. Furthermore, to completethe research, the strategy has also been tes-ted using real poll data. The actual data (withand without imputation, to reduce partial res-ponse) collected in two post-election surveys(2716 and 2720) conducted by the CIS werealso analyzed. Four different estimators havebeen used to generate predictions: a directestimator (DIR), which translates poll raw res-ponses into percentages, and three additionalestimators that make use of vote recall res-ponses to improve forecasts, namely theweighted ratio estimator (RAT), the post-stra-tication estimator (PS) and the HD superpo-pulation estimator, which, to elaborate its fo-recasts, regresses the vote-recall correctedestimates obtained in the sampled sections onthe outcomes recorded in those same sec-tions in the previous elections.

    In light of the results, we have obtained avaluable set of ndings with broad practicalimpact. Despite the different political andgeographic areas considered in the two si-mulation exercises implemented and the di-fferent hypotheses considered concerningthe asymmetric response rates of voters, theconclusions reached in the simulations forboth elections are very similar. In particular, it

    could be stated that:i) All estimators generate highly accurate

    predictions under ideal conditions (wi-thout nonresponse or response errors).

    ii) Introducing vote recall as an auxiliaryvariable in the forecasting process im-proves estimate accuracy, with the PSestimator producing the closest t toraw data.

    iii) Of all three vote recall based estimators,the PS estimator yields the best predic-tions in ideal conditions.

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    23/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 259

    iv) The PS estimator, however, suffersgreatly in presence of nonresponse biasand becomes less accurate the greaterthe bias.

    v) The HD estimator is clearly the best pre-dictor when more realistic sampling cir-cumstances are considered (when non-response bias and response errorappear in the samples).

    vi) The RAT estimator generates (on avera-ge) quite similar solutions to the HD es-timator, albeit generally less accurate.

    vii) The difference in relative accuracy bet-ween the HD estimator and the RAT and

    PS predictors increases when nonres-ponse bias grows.

    viii) Despite the obvious superiority that theHD estimator shows in more realistic si-tuations, there still seems to be room formore accurate strategies.

    ix) In general, selecting a smaller number ofsections (along with drawing a largernumber of interviews per section)

    slightly worsens estimate accuracy, al-beit unevenly for each estimator. TheHD estimator suffers the least from thischange in sampling strategy.

    x) As a rule, the combined use of the HDestimator and an AL sampling designwould increase the quality of estimatesand reduce monetary costs.

    The results obtained using the actual datacollected in both the CIS 2716 and 2720 sur-veys points in the same direction as thoseachieved in the simulations. On the one hand,working with the 2716 poll data, the HD esti-mator clearly dominates its competitors, be-ing able to appreciably reduce the enormousbias of the original data. On the other hand,we nd that both the PS and HD estimatorsgenerate quite comparable predictions withthe 2720 survey data, HD performance ne-vertheless still constituting an improvementon direct estimates, which become more bia-sed after imputation.

    In view of the ndings obtained in this stu-dy, the recommendation is clear: the HD pre-dictor should be placed ahead of the PS andRAT strategies. Such a decision would al-most certainly result in an average improve-

    ment in prediction accuracy. Furthermore,the quality of the estimates would not be sig-nicantly altered if a sample design in whichfewer sections and more electors by sectionwere sampled was adopted, particularly ifthis were accompanied by the use of the HDestimator. Therefore, taking into account thatadopting such a plan would be less expen-sive, the only impediment that could initiallydiscourage the industry from adopting it

    would be that such a change would affectaccuracy when estimating other issues thatare also included in electoral polls (such asleader evaluation). This potential limitation,however, could be adequately overcome byselecting sections in a completely non ran-dom fashion. This approach would be per-fectly acceptable within the HD strategy, as itdoes not require a random selection to beapplied. Adopting the HD approach would

    therefore also have the advantages of usingpurposive sampling, which could, a priori ,guarantee adequate representation in thesurvey of the whole socio-political spectrum.

    Despite the preceding statements, the HDestimator is not a panacea. On the one hand,the fact that the HD predictor on averageachieves the best forecasts with realisticsamples does not guarantee that it will gene-rate better predictions than its competitors the RAT and PS estimators for a particularsample. Likewise, on the other hand, as HDbias gures remind us, there is still room forimprovement in this context. In this regard, itwould be interesting to explore whether ex-pressly considering the geospatial dimensionof the data and/or a more ef cient use ofsample and population information couldlead to more accurate predictions.

    In addition to the issues outlined in theparagraph above, which lead to other sug-gestive avenues of research using alternative

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    24/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    260 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    approaches, there are at least a couple of to-pics within the approach taken in this paperthat should be addressed in the future: (i) De-ciding the sampling design that best suits theHD estimator and (ii) determining the most

    appropriate strategy for computing the sam-pling variance of the HD estimator taking intoaccount its features.

    REFERENCES Aybar, C. (1998): Modelos de Superpoblaci n y Po-

    blaciones Discretas Finitas , Tesis Doctoral, Uni-versitat de Val ncia: Valencia.

    Bernardo, Jos M. (1997): Probing Public Opinion:The State of Valencia Experience , en BayesianCase Studies 3 , eds. C. Gatsonis, J.S. Hodges,R.E. Kass, R. McCulloch, P. Rossi and N.D. Sin-gpurwalla, New York: Springer-Verlag, pp. 3-21.

    Bernardo, Jos M. and F. Javier Gir n (1992): RobustSequential Prediction from Non-random Sam-ples: the Election Night Forecasting Case , inBayesian Statistics 4 , eds. J.M. Bernardo, J.O.Berge, A.P. Dawid and A.F.M. Smith, Oxford:Oxford University Press, pp. 61-77.

    Biemer, Paul, Ralph Folsom, Richard Kulka, JudithLessler, Babu Shah and Michael Weeks (2003): An Evaluation of Procedures and OperationsUsed by the Voter News Service for the 2000Presidential Election , Public Opinion Quarterly, 67: 32-44.

    Curtin, Richard, Stanley Presser y Eleanor Singer(2005): Changes in Telephone Survey Nonres-ponse over the Past Quarter Century , PublicOpinion Quarterly, 69: 87 98.

    de Leeuw, Edith and Wim de Heer (2002): Trends inHousehold Survey Nonresponse: A Longitudinaland International Comparison , in Survey Non-

    response , eds. R.M. Groves, D.A. Dillman, J. L.Etinge and R.J.A. Little, New York: Wiley, pp. 41-54.

    Galv n, Marco and Fernando Medina (2007): Impu-taci n de Datos: Teor a y Pr ctica , Santiago deChile: United Nations Publications.

    Greene, William H. (2003): Econometric Analysis , NewJersey: Prentice Hall.

    Groves, Robert M. (1989): Survey Errors and SurveyCosts , New York: John Wiley & Sons.

    Groves, Robert M. (2006): Nonresponse Rates andNonresponse Bias in Household Surveys , PublicOpinion Quarterly , 70: 646-675.

    Groves, Robert M. and Mick P. Couper (1998): Non- response in Household Interview Surveys , New

    York: Wiley.Groves, Robert M., Don A Dillman, John L Eltinge and

    Roderick J.A. Little (2002): Survey Nonresponse ,New York: Wiley.

    Groves, Robert M., Eleanor Singer, Amy D. Corningand Ashley Bowers, (1999): A Laboratory Appro-ach to Measuring the Joint Effects of InterviewLength, Incentives, Differential Incentives, andRefusal Conversion on Survey Participation , Jo-urnal of Ofcial Statistics , 15: 251-268.

    Gschwend, Thomas, Ron Johnston and Charles Pat-tie (2003): Split-Ticket Patterns in Mixed-Mem-ber Proportional Election Systems: Estimates and

    Analyses of their Spatial Variations at the GermanFederal Election, 1998 , British Journal of PoliticalScience , 33: 109-128.

    Isaki, Cary T., Julie H. Tsay and Wayne A. Fuller (2004):Weighting Sample Data Subject to IndependentControls , Survey Methodology , 30, 35-44.

    Johnston, Ron and Charles Pattie (2000): EcologicalInference and Entropy-Maximizing: An Alternative

    Estimation Procedure for Split-Ticket Voting ,Political Analysis , 8: 333-345.

    Kalton, Graham and Daniel Kasprzyk (1986): TheTreatment of Missing Data , Survey Methodology ,12: 1-6.

    Keeter, Scoot, Carolyn Miller, Andrew Kohut, Robert M.Groves, and Stanley Presser (2000): Consequencesof Reducing Nonresponse in a National TelephoneSurvey , Public Opinion Quarterly, 64: 125 48.

    King, Gary (1997): A Solution to the Ecological Infe-

    rence Problem: Reconstructing Individual Beha-vior from Aggregate Data , Princeton, NJ, Prince-ton University Press.

    King, Gary, James Honaker, Anne Joseph, and Ken-neth Scheve (2001): Analyzing Incomplete Poli-tical Science Data: An Alternative Algorithm forMultiple Imputation , American Political ScienceReview , 95 49-69.

    Konner, Joan (2003): The Case for Caution. ThisSystem Is Dangerously Flawed , Public OpinionQuarterly , 67: 5-18.

    Martin, Elizabeth A., Michael W. Traugott and Court-ney Kennedy (2005): A Review and Proposal for

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    25/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    Jos M. Pav a y Beatriz Larraz 261

    a New Measure of Poll Accuracy , Public OpinionQuarterly , 69: 342-369.

    Mitofsky, Warren J. (2003): Voter News Service afterthe Fall , Public Opinion Quarterly , 67: 45-58.

    Mitofsky, Warren J. and Edelman Murray (2002):

    Election Night Estimation , Journal of Of cialStatistics , 18: 165-179.

    Merkle, Daniel M. and Edelman Murray (2002): Non-response in Exit Polls: A Comprehensive Analy-sis , in Survey Nonresponse , R. M. Groves, D. A.Dillman, J. L. Etinge and R. J. A. Little (eds.), NewYork: Wiley.

    Pav a, Jose M. (2010): Improving Predictive Accura-cy of Exit Polls , International Journal of Forecas-ting , 26: 68-81.

    , Bernardi Cabrer and Ram n Sala (2009): UpdatingInput-Output Matrices: Assessing Alternativesthrough Simulation , Journal of Statistical Compu-tation and Simulation , 79: 1467-1498.

    , Beatriz Larraz, and Jose Mar a Montero (2008):Election Forecasts Using Spatio-Temporal Mo-dels , Journal of the American Statistical Asso-ciation , 103: 1050-1059.

    -and L pez-Quilez, Antonio (2012): Spatial VoteRedistribution in Redrawn Polling Units , underreview.

    Pav a-Miralles, Jose Manuel (2005): Forecasts fromNon-Random Samples: The Election Night Case ,

    Journal of the American Statistical Association ,100: 1113-1122.

    Pav a-Miralles, Jose Manuel and Beatriz Larraz-Iribas(2008): Quick Counts from Non-Selected PollingStations , Journal of Applied Statistics , 35: 383-405.

    Randall, J. Jones, Jr. (2008): The State of Presiden-tial Election Forecasting: The 2004 Experience ,International Journal of Forecasting , 24: 310-321.

    Rodr guez Osuna, Jacinto (1991): Mtodos de Mues-treo. Madrid: Centro de Investigaciones Sociol -gicas.

    (2005): M todos de Muestreo. Casos Pr cticos.Madrid: Centro de Investigaciones Sociol gi-

    cas.Ros n, Bengt (1997a): Asympotic Theory for Order

    Sampling , Journal of Statistical Planning andInference , 62: 135-158.

    Ros n, Bengt (1997b): On Sampling with Probabili-ty Proportional to Size , Journal of StatisticalPlanning and Inference , 62: 159-191.

    S rndal, Carl-Erik (2007): The Calibration Approachin Survey Theory and Practice , Survey Methodo-

    logy , 33: 99-119.

    S rndal, Carl-Erik and Sixten Lundstr m (2005): Es-timation in Surveys with Nonresponse , Chiches-ter, England: John Wiley & Sons.

    S rndal, Carl-Erik, Bengt Swenson and Jan Wretman(2003): Model Assisted Survey Sampling , NewYork: Springer.

    Schafer, J.L. (1997): Analysis of Incomplete Multiva- riate Data , Boca Rat n, Florida: Chapman & Hall/ CRC.

    Singer, Eleanor (2006): Introduction: NonresponseBias in Household Surveys , Public Opinion Quar-terly , 70: 637-645.

    Smith, Tom W. (1999): Developing NonresponseStandards , paper presented to the InternationalConference on Survey Nonresponse , Portland,October, 1999. http://cloud9.norc.uchicago.edu/ dlib/nonre.htm.

    Valliant, Richard, Alan H. Dorfman and Richard M.Royall (2000): Finite Population Sampling andInference. A Prediction Approach , New York:John Wiley & Sons.

    RECEPTION: 27/04/2010 ACCEPTANCE: 03/06/2011

  • 8/13/2019 Dialnet-SesgoDeNorespuestaYModelosDeSuperpoblacionEnEncues-3811090

    26/28

    Reis 137 , enero-marzo 2012, pp. 237-264

    262 Nonresponse Bias and Superpoppulation Models in Electoral Polls

    A PPENDIX

    Appendix I: Estimation of cross-votingdistributions (vote recall)

    In order to estimate the cross-distribution ofvotes in each census section, a two-step pro-cedure was carried out. In a rst step, thevoting transfer matrices obtained from the2716 and 2720 poll data were adjusted tomake them consistent with the actual resultsregistered in each constituency. In a secondstep, a transfer matrix was estimated in eachsection by balancing the outcomes of thesection to the matrix obtained for the wholedistrict. 30

    By way of illustration, let us consider twosuccessive elections with only 3 electoralchoices (A, B or C) in which 1 million voteswere cast. [We assume the same electors inboth elections to simplify]. Let us say that inthe rst election each option, A, B and C,gained 500,000, 400,000 and 100,000 votesrespectively, and let us assume that in the

    next election each party received 450,000,440,000 and 110,000 ballots. Admit that asurvey of 1,000 voters, in which each res-pondent was asked for his/her vote in thecurrent and previous elections, were collec-ted and that the results were classi ed in acontingency table as below, where the rowsrepresent vote recalls and the columns thecurrent election votes (for example, 55 vo-ters said they had chosen option B in the

    previous election and option A in the currentelection).

    30 From a practical-heuristic perspective, it is inef -cient to work with all possible kinds of electoral be-haviours. Consequently, in order to make the problemmore manageable, the following were considered aspossible categories to vote for: PP, PSOE, IU, Othersor Blank, Abstention or No vote by age (this last optiononly for vote recall) in the case of the Madrid Assem-bly elections and, in the case of the Barcelona localelections, PSC, CiU, PP, ERC, ICV, Others or Blank,

    Abstention or No vote by age (the latter option onlyfor vote recall.)

    A B C

    A 400 80 10 490 B 55 350 15 420 C 3 10 77 90

    460 440 100 1,000

    It is clear that this sampling transfer ma-trix is not completely consistent with the ag-gregate outcomes recorded in both elections(for example, according to the sample, partyC would have obtained 90,000 votes in thefirst election, when in fact they gained100,000). Nevertheless, taking the cross-distribution derived from the survey as a star-ting point, a voting transfer matrix amongoptions and between elections could beapproximated by imposing as constraintsthat row and column sums should match theactual values recorded. More speci cally,using the RAS method (see, e.g., Pavia et al. ,2009) at the constituency level yields the fo-llowing transfer matrix:

    A B C

    A 399,517 89,735 10,748 500,000 B 47,393 338,698 13,909 400,000 C 3,090 11,567 85,343 100,000

    450,000 440,000 110,000 1,000,000

    After computing the estimated distribu-tion of cross-voting at constituency level thesame methodology is applied at section level(taking the cross-distribution estimated forthe whole district as a referen