An assessment of the age reporting in Tanzania population census 2012

The objective of this paper is to provide data users with a worldwide assessment of the age reporting in the Tanzania Population Census 2012 data. Many demographic and socio-economic data are age-sex attributed. However, a variety of irregularities and misstatements are noted with respect to age-related data and sex ratio data because of its biological differences between the genders. Noting the misstatement / misreporting, inconsistence of age data regardless of its significant importance in demographic and epidemiological studies, this study assess the quality of the 2012 Tanzania Population and Housing Census data relative to age. Data were downloaded from Tanzania National Bureau of Statistics. Age heaping and digit preference were measured using summary indices viz., Whipple‟s index, Myers‟ blended index, and Age-Sex Accuracy index. The recorded Whipple‟s index for both sexes was 154.43, where males had the lower index of about 152.65 while females had the higher index of about 156.07. For Myers‟ blended index, the prefrences were at digits „0‟ and „5‟ while avoidance were at digits „1‟ and „3‟ for both sexes. Finally, the age-sex index stood at 59.8 where the sex ratio score was 5.82, and the age ratio scores were 20.89 and 21.4 for males and female respectively. The evaluation of the 2012 Population Housing Censes data using the demographic techniques has qualified the data as of poor quality as a result of systematic heaping and digit preferences/avoidances in recorded age. Thus, innovative methods in data collection along with measuring and minimizing errors using statistical techniques should be used to ensure accuracy of age data.


INTRODUCTION
Age is an important demographic variable utilized for descriptive and statistical analyses of population structure and to forecast of population growth. Many demographic and socio-economic data are age and sex attributed 1 .
The latter is understandable because of the biological differences between males and females. A variety of anomalies and misstatement are noted with respect to age-related data [1,2] and thus, age tends to be more susceptible to such anomalies regardless of its crucial significance in demographic and epidemiological analysis [2,3]. Likewise, (4,5) report that age data tend to be more misreported than sex, yet its very important variable in health related studies of demography and epidemiology; it"s also considered as one of the constant components in community-based surveys. Thus, the misreporting/misstatement of age constitute one of the most demographic challenges [6]. The accuracy of demographic data, especially that of age, varies from one country to another and depends on numerous factors [3, 7 and 8]. The deficiencies are mostly in developing countries due to lack of administrative machinery and problems in the collection and tabulation of data [9]. Other common errors arise from underreporting of children less than one year, and over-statement to exact age at very advance age to qualify for political or socio-cultural affairs 1 . Factors like ignorance of one"s true age, low numeracy skills, open hostility to some inquiry. Also, cultural preferences were pronounced as source of age heaping in [2, 3 and 10] reports.
Anomalies in demographic data comprise of mainly two types viz., (i) coverage error and (ii) content error.
Coverage errors occur when a person is being counted twice or missed during enumeration. The scope of this paper will however consider the second type (i.e. content errors) that occur when persons" characteristic during survey/census/registration of vital event are incorrectly reported or tabulated. Leaving apart other subcategories under this type, the study"s objective is on respondents" voluntary or involuntary failure to give the appropriate information as required. The classic example is the misstatement/misreporting of age. To detect it, demographers have developed various techniques to ascertain the quality of age data [1]. Age heaping summary indices (viz., Myer"s blended index, Whipple"s index, and age-sex accuracy index) are among the indices developed to detect preference or avoidance of certain end digits in reported age (i.e. they are used to measure quality and consistency of age data).
Age anomalies or heaping is a common phenomenon of misstatement or misreporting of age data. Due to this, age data frequently displays excess frequencies at "round" or "attractive" age digits. For example, attractions or repulsion is normally on even numbers or multiple of "5". Consequently, this study uses age heaping summary indices viz., Whipple"s index, Myer"s blended index, and Age-sex accuracy index to assess the quality and accuracy of age data reported from 2012 PHC.

Study Methods
This paper uses age and sex data from 2012 Population Housing Census from NBS made available at http://www.nbs.go.tz.
Age heaping summary indices generally assumes that individual ages are evenly distributed over specific age groups and by extension entire age spectrum [3,6]. However, each index has its own technique of calculation, as presented below. For Whipple"s Index, data are imported into R-language 1 for analysis, and detailed codes were used. The remaining indices viz., Myer"s index and age-sex accuracy indices are calculated in their natural way using Microsoft Excel spreadsheet.
Whipple"s index measures heaping on ages ending in "0" and "5" reported in single years of age returned between "23" and "62" years inclusive [6]. Whipple"s index is calculated by summing the number of persons in June 03, 2015 the age bracket 23 and 62 inclusive, and calculating the ratio of reported ages ending at 0 and 5 to 1/5 of the total population or sample in the same age groups, multiplied by 100 [6], mathematically written as: In the absence of heaping or preferences at "0" or "5", the index will have a value of 100 while a value of 500 indicate a complete heaping or preference at "0" or "5". The inferences about the index are; <105 = highly accurate data; 105 -109.9 fairly accurate data; 110 -124.9 = approximate data; 125 -174.9 = rough data; and ≥ 175 = very rough data [4,12].
Myer"s blended index calculates the preference or avoidance of ages reported in any of the 10 digits expressed as percentages [13]. In the absence of systematic irregularities in age reporting in any of the digits ending at "0" through "9", the blended sum at each terminal digit should be approximately equal to 10% of the total blended population [4,6]. If the sum of the blended index exceeds 10% then this indicates over selection of age ending in that digit (i.e. digit preference). If it is less than 10% it indicates under selection of age ending in that digit (i.e. digit avoidance).
3. Weigh the sum in step 1 and 2 and add the results to obtain a blended population (i.e. weights 1 and 9 for 0 digit, weights 2 and 8 for 1 digit etc.).

Convert the distributions in step 3 into percentages.
5. Take the deviations of each percentage in step 4 from 10 that are the expected value for each percentage.
6. A summary index of preference or avoidance for all terminal digits is derived as one half of the sum of the deviation from 10% each without regards to signs.
Age-Sex Accuracy Index (ASAI) measures the level of the quality of age and sex data in five-year age groups [2,11]. The index is calculated as three times the average of sex ratio differences plus the average of the deviation from 100 of male and female age ratios, algebraically written as: (where, ASAI = age sex accuracy index, SRS = sex ratio scores, ARSm = age ratio scores for males, and ARSf = age ratio scores for females).
United Nations recommendations for scaling the estimate of reliability of the data is categorized as 0 -19.9 for accurate data, 20 -39.9 for inaccurate data, and above 40 for highly inaccurate data [12]. When working on the ASAI, two important parts viz., (i) sex ratio scores and (ii) age ratio scores as shown in equation 2 above are calculated separately as follows: Age ratio score (ARS): ratio of the population in the given age group to one-half of the population in the two adjacent groups multiplied by 100 [2,3]. The ratio score is calculated separate for each male and female age group. Mathematically, let nARSx be the age ratio score for age group x to x+n, nPx be the age group from age x to x+n, nPx-n and nPx+n be the proceeding and succeeding age groups respectively, then: The computed age ratio is compared with the expected value of 100. The discrepancy at each age group is a measure of net age misreporting [2,3]. Taking the average deviation (regardless of the sign) from 100 of the age ratios and summing over all the age groups derives the measure of the accuracy of an age distribution [4]. An age ratio of under 100 implies that members of the age group are either selectively under enumerated or that errors in age reporting resulted in misclassifying persons who belong to the age group. A ratio of more than 100 suggests the opposite of either one or the other or both of these conditions [2].
Sex ratio score (SRS): ratio of the population of males in the given age group to the population of females in that age group multiplied by 100 [1]. Mathematically, let nSRSx be the sex ratio score of age group x to x+n, The accuracy sex ratio index is obtained through summing over the successive differences between one age group and the next one (irrespective of the sign) and then taking the average of the summation [14].

Results
The Whipple"s index of age data from PHC 2012 is presented in Table 1  Though both data are rough, the quality of male is slightly better compared to that of female. Three indices viz., Tables 1, 2 and 3 presented separately here.  (7),Percentage distribution (%) (9),Deviation from 10%(10),Remarks (11)
Instead they choose a figure they think is plausible [4,12]. Their choice is not random but has a systematic tendency to prefer attractive numbers such as those ending with "0" and "5" or even numbers or in some societies, numbers with other specific terminal digits due to cultural beliefs [4]. Consequently, age heaping in the country June 03, 2015 indicates ignorance of one"s own age or a tendency to round ages [2,3]. It should however be noted that, high heaping observed in female to male might indicate that women in Tanzania are more illiterate and might lack numeracy skills [9,15]. High anomalies in age reported data for female might be the result of proxy reporting of age data by male who mostly tend to be the head of household in developing countries. Thus, the evaluation of the 2012 PHC data using the demographic techniques above has finally qualified age related information of poor quality as the results of systematic heaping and digit preferences/avoidances. This means that, age awareness in the country is quite low and many have only a vague idea about their age. The impact of such misreporting can lead to misclassification bias and wrong assessment of demographic rates and interfere with planning effective interventions [9,15]. It should however be noted that, these demographic techniques are very quick and inexpensive on general quality of data and provide some evidence of errors on specific segments of the population but not the magnitude of those errors.
Therefore, this study recommends the need to work with other assessment methods [12] to ensure the accuracy of the age data. For example, methods highlighted in [3, 4 and 17] of using local calendar of event were proved to be remarkably useful in elderly populations for generating high quality age related information. Thus, other innovative methods of data collection that ensure the accuracy of age data in all age categories are recommended [17]. Other methods, like age smoothing are recommended to correct age related information. But the most important precaution should be to collect data using methodologies that will ensure quality of age data.
Notable example is the use date of birth (in the form of date, month, year) to overcome the challenges of using completed year (year at last birthday) that proved to be a source of age heaping since individual might round to the nearest age [2]. Alternatively, vital registration system should be adequately improved so as to get reliable demographic information to supplements census and survey data.  Minus signs indicate avoidance of certain terminal digits while positive signs are for preference to those terminal digits. It can be clearly seen from the table and figure that, preference is at terminal digits "0", "2", "5", and "8". Avoidance is at terminal digits "1", "3", "4", "6", "7", and "9". Also indicates that high preference is at terminal digit "0" and much anomalies are observed more on females than males, while avoidance is at terminal digit "1". Again much deviation is observed on females than males.    Table 5 shows an average sex ratio score of approximately 5.82, and male and female age ratio scores of about 20.89 and 21.45 respectively. We get an age-sex accuracy index of about 59.8. According to the scale that UN proposes, the calculated age sex index is extremely very high. This rates the age data as of poor quality.