|
|
A PROGRAM FOR CALCULATING POVERTY MEASURES FROM GROUPED DATA
Shaohua Chen, Gaurav Datt,
and Martin Ravallion
Poverty and Human Resources Division
Policy Research Department, World Bank
OVERVIEW:
POVCAL is designed to be an easy
to use and reliable tool for routine poverty assessment work. It uses
sound and accurate methods for calculating poverty and inequality
measures with only a basic PC and any of the various types of grouped
distributional data typically available, often in published form.
If one has access to the relevant
household level ("unit record") data, then there are accurate
computational methods for estimating poverty and inequality measures
directly from that data, using standard econometric/statistical
packages. But one rarely has access to the data in this form. Or one
might not have easy access to the sort of computing power (typically a
mainframe) that is needed to process unit record data.
Distributional data are more typically
available in grouped form, such as income shares of deciles of household
ranked by per capita income. There are some subtle difficulties in using
such data which do not arise with unit records. There are many ways to
estimate poverty and inequality measures from such data. The commonly
used interpolation methods can be quite unreliable. The approach we have
adopted here uses parametric specifications of the underlying Lorenz
curve, from which all desired measures can then be calculated. This has
a number of advantages. The method is efficient, reliable, and accurate
(at least for the particular specifications that we use in this
program). It also facilitates certain simulations which can be of
analytical interest, such as estimating how the poverty measures will
respond to distributionally neutral
growth (interpretable as an increase in the mean of the distribution,
holding the Lorenz curve constant).
To implement this approach in a fully
self-contained way for use on any PC, POVCAL has been programmed largely
from first principles. The source code is in Microsoft FORTRAN 5.0 and
was written by Shaohua Chen. POVCAL is designed to be a practical tool
for Bank staff; however, it is not a commercial program with a flash
user inter-face.
The program will run on any IBM
compatible PC, including the most basic model. The size of the data set
it can handle will depend on the memory of your PC, but 640K should be
ample for almost all applications. Its speed will depend entirely on the
speed of your computer.
You will need to have your grouped
distributional data set up in the way we describe in detail below
(though the program allows this to take many forms, as discussed below),
and you will need to know the poverty line. The program will estimate
the Lorenz curve, Gini index, headcount index of poverty, poverty gap
index, Foster-Greer-Thorbecke index, and the elasticities of these
poverty measures with respect to the mean of the distribution, and the
Gini index. It does all this for two alternative specifications of the
Lorenz curve - the General Quadratic (Villasenor and Arnold) and the
Beta model (Kakwani). (We have found these to be better than the many
alternatives in the literature). It performs various checks on the
results, and it tells you which specification is better for your data.
The program is also set up to allow detailed poverty profiles to be
readily constructed; you simply re-run the program for each sub-group in
the poverty profile. (All the poverty measures are additively
decomposable; the aggregate poverty measure is simply the population
share weighted sum of the sub-group poverty measures as calculated by
POVCAL.) It also allows assessments to be easily made of the sensitivity
of the results to measurement assumptions, such as the choice of the
poverty line.
SETTING UP YOUR DATA:
It is probably convenient to first set-up a sub-directory (POVCAL,
say), and load all the files on the disk supplied into that
sub-directory (with "COPY A:*.* C:" if the disk in the A
drive). Make sure that POVCAL.EXE and your input data file are in the
same sub-directory. POVCAL can be run only from this sub-directory.
Your data file is assumed to comprise
"records" and "sub-groups". The number of records is
simply the number of class-intervals or fractiles
in your data. For example, if your data are in the form of a table of
decile income shares, then you will have 10 records.
The number of sub-groups is the number of
ways the underlying population has been divided up in presenting the
distributions. For example, if it is divided up as "urban" and
"rural", then you will have two sub-groups. National data
comprise one sub-group.
Your data file must be set-up in tabular
form where each row corresponds to a record, and the columns correspond
to the variables for the sub-groups; enter all variables for the first
sub-group, followed by those for the second etc. For each sub-group, the
columns must be in the same order as the variables specified in the
relevant option for your type of data, as explained below.
Distributional data can come in many
forms. The program allows eight possibilities, which should accommodate
all of the data found in practice. The program asks you to select one
option.
Each is defined by two or (sometimes)
three variables. The options are described in the following Table.
________________________________________________________________
| DATA OPTIONS |
- Type 1: p=cumulative
proportion of population (ranked by the poverty indicator,
which we will call "income"), L=cumulative
proportion of income held by that proportion of the
population.
- Type 2: q=proportion of
population (as in p, but not cumulative), r=proportion of
income (as in L, but not cumulative).
- Type 3: p (as in 1), r (as in 2).
- Type 4: q (as in 2), L (as in 1).
- Type 5: f(x)=percentage of the
population in a given class interval of incomes, X=the mean
income of that class interval.
- Type 6: upper bound of a class
interval, f(x) (as in 5), X (as in 5).
- Type 7: upper bound of a class
interval, p (as in 5), X (as in 5).
- Type 8: upper bound of a class
interval, f(x) (as in 5).
NOTE: The program allows your
data to either be expressed as a percentage or as a
proportion.
Data types 6, 7 and 8 include
information on the upper bounds of each class interval. In this
case, you will need to set an upper bound for the highest
(richest) class interval, though the choice is arbitrary, and
does not affect any of the calculations.
Data type 8 is potentially
troublesome, though (thankfully) it does not appear to be
common. This type does not include the mean of each class
interval. There is no option but to make assumptions about where
the mean lies within each class interval. Common practice is to
use the mid-points. In our experience (by using data sets for
which we do know the mean, but pretending that we do not, and
trying alternative assumptions), this is generally fine for all
but the lowest and highest class intervals. You will probably
get better results with the following rule of thumb which we
have built into the program: i) The mean of the lowest (poorest)
class interval is assumed to be 80% of the upper bound of that
class interval. ii) The mean of the highest class interval is
set at 30% above the lower bound of that class interval. iii)
For all other class intervals, the mean is set at the midpoint.
(The program re-writes your data file with these estimated
means, so in subsequent runs you can treat it as a type 6 data
set.) Needless-to-say there will be some loss of accuracy when
using data sets for which you do know the means of each class
interval. This rule-of-thumb still gave quite accurate results
for the poverty measures in our experiments. If you want to try
alternative assumptions, then use option 6, adjusting the
(re-written) data file with your estimates of the means.
Also use type 8 when (although
you do not know the means for the class intervals) you do have
an estimate of the population mean. The program will prompt for
that information, and choose a mean for the richest class
interval consistent with your estimate of the overall mean.
(Otherwise, it will assign means by the above rule of thumb.) Of
course, this is also an arbitrary choice, and we would still
recommend you test the sensitivity of your results to these
assumptions. |
RUNNING POVCAL:
Type POVCAL and press ENTER.
The program will ask you for the following information:
1. The name of your ASCII data file.
2. The number of sub-groups.
3. The number of records.
4. The type of data you have (8 options).
5. The DOS name of your desired output file.
The program will then estimate the
General Quadratic (GQ) Lorenz curve and give you a statistical summary
of the results. After that it will ask you for the mean (if different
from that estimated from your data; you may, for example, want to test
for sensitivity to measurement error in the mean, assuming that the
Lorenz curve is unaffected), and the poverty line, which must lie in the
stipulated interval to give valid estimates.
The program will give you the Gini index
and poverty measures for this Lorenz curve. It will also give you the
elasticities of the three poverty measures with respect to the mean and
the Gini index. (The latter calculation assumes that the Lorenz curve
shifts equi-proportionately up or down at all points.)
Next, all of the above will be repeated
for the Beta Lorenz curve.
Then, the program will give you an
assessment of which Lorenz curve (and corresponding estimates) you
should prefer for your data.
Finally, the program will graph the
fitted Lorenz curves, and their first and second derivatives.
CHOOSING BETWEEN THE TWO LORENZ CURVES:
Sometimes the choice is obvious, but other times some judgement is
needed. The program "builds in" what we consider to be sound
criteria for making such judgement.
The program will first check whether your
fitted models satisfy the theoretical conditions for valid Lorenz
curves. Four conditions should hold: i) it's upper bound should be one,
ii) it's lower bound should be zero, iii) it should be strictly
increasing throughout, and iv) it's first derivative should be strictly
increasing throughout (convex from below). Some of these conditions hold
automatically for one or both specifications, while others have to be
tested for your data. The program does this and reports the result.
You can check all these conditions
yourself from the graphs for each Lorenz curve, and its first and second
derivatives, both of which should be positive (above the bold middle
line) throughout. (It does not matter if the second and third graphs
look strange - jagged - as long as they are above the middle horizontal
line.) This allows you to see precisely where any violations of the
conditions for a valid Lorenz curve are happening, and to assess the
extent of the problem.
Reliable estimates of the headcount index
of poverty are often possible as long as the Lorenz curve has positive
first and second derivatives in a neighborhood of the headcount index.
So if neither Lorenz curve is globally valid all is not lost, though one
should be wary of the estimated Gini index and the other poverty
measures.
Similarly, the first condition above -
that the upper bound of the Lorenz curve is exactly one - is not
essential for even quite accurate estimates of the poverty measures.
Serious violations of the theoretical
conditions are rare in our experience, and when they do happen it is
typically because the primary data have been grouped badly. When only
minor infringements occur, the estimates may still be quite good. You
have to make this judgment yourself, aided by the above comments and the
graphs given by POVCAL.
The program also assesses the
goodness-of-fit of the Lorenz curves. This can be done in two ways: i)
by comparing the sum of squared errors over the whole Lorenz curve, ii)
by comparing the sum of squared errors over the part of your Lorenz
curve up to the headcount index of poverty. Results are given for both,
but the second comparison is more appropriate for poverty measurement,
so that is the basis on which the program decides which Lorenz curve
fits the data better.
AN EXAMPLE:
The disk also includes a trial data set, in the file INDIA.DAT. This
is the same data for rural India used in the example given in Gaurav
Datt's paper (referenced below; see his Table 1). As you will see, it
has 1 sub-group, 13 records, and it is a type 5 data set, entered as
percentages. The poverty line is Rs 89. By comparing the set-up of
INDIA.DAT with columns 2 and 3 in Table 1 of Datt's paper, then running
the program with this data set and comparing the results to those
reported in Table 4 of Datt's paper (for the General Quadratic Lorenz
curve) you should get a good idea of what is going on. Your results
should be the same as those in Datt's paper, allowing a little for
rounding errors. To double check, we also include a file INDIA.OUT which
is the output you should get using INDIA.DAT as the input. Looking
through this file in advance will also show you what POVCAL does (though
the graphs are only sent to the screen; you will need special software
to print them).
BACKGROUND READING:
The paper, "Computational Tools for Poverty Measurement and
Analysis", by Gaurav Datt summarizes the theoretical results being
used by the program, and gives further references to the theoretical
literature.
For background reading on the theory and
practice of poverty measurement, see, "Poverty Comparisons: A Guide
to Concepts and Methods", by Martin Ravallion.
For an example of the use of these
methods in country poverty assessments see "Measuring Changes in
Poverty: A Methodological Case Study of Indonesia During an Adjustment
Period", by Martin Ravallion and
Monika Huppi, in the January 1991 issue of World Bank Economic Review.
FEEDBACK:
Let us know if you have any problems, or any suggestions for improving the
program.
Shaohua Chen, Martin
Ravallion |