A Statistical Package  by Ronald Ng, ASTAG, World Bank
This package currently contains a series of four programs designed for the nonstatistician undertaking monitoring and evaluation. Two of these programs allow the user to determine a minimum sample size for analysis of key parameters that have been identified as important for project implementation. The advantage of these programs is that they are participative. Sample size is determined on the basis of sound statistical calculations, (but also) on the basis of information already available to the user/s about the parameters to be studied. Thus, the users determine sample size by interactively inserting "what they know already" about the situation, so that a manageable, minimum sample size can be selected to test their working hypotheses. This program is "dirty" because it relies upon the quality of information which the user inputs. If the data is fairly reliable, the survey becomes less "dirty", if the data is incorrect, then the process is more "dirty". The statistical package programmed in is, however, itself very sophisicated and very sound.
The program is ideal as part of an iterative monitoring and evaluation process, whereby on the basis of an earlier rapid appraisal, users can develop a statisticallyvalid sample for a survey to study more formally key parameters identified during the rapid appraisal. The results of the more formal survey in turn can be the basis for discussions with target beneficiaries regarding the problems and issues to be resolved by project intervention.
Both sampling programs use a cluster sampling methodology. The logic behind this choice of sampling method is quite simple. It is assumed that the reason for needing a "quick and dirty" design is that it is not practical to travel long distances identifying random elements for study, thus a cluster is easier to sample. Second, while a stratified sample is the most efficient choice in most cases, it is assumed that the information required to construct a proper stratified sample is not available to the users. Therefore the logical choice is a cluster sample. Option "C" is used for calculating the sample size to measure incidence of a quantitative parameter, such as "crop yield figures" or "household income", while option "D" is used to calculate the sample size to survey a proportional parameter, such as "the proportion of households who are tenant farmers" or "the proportion of farmers who have adopted a new technology".
Based on the range of variation found in preliminary observations regarding a particular parameter, it becomes possible for the user to make an educated judgement of the^{ }optimal number of clusters that should be surveyed.
Instructions for Use of the Sampling Programs
In order to construct a sample size, the user should enter the information available to him from his own rapid appraisals of the project area or about the population in question. If there are several variables about which the user wants information from the same sample, each program can be run repeatedly using the same rapid appraisal data. The objective may be to collect information both on "crop yield" and " income data". Option C should be run consecutively for each parameter and the optimal sample size will be the larger of the two recommended sample. Similarly if two or more proportions are to be measured, the user will "run" Option D for each proportion and sample the largest recommended number of clusters and respondents. In some cases, the user might wish to collect both quantitative and proportional information ("proportion of adopters" and "average income" or "average number of trees planted" ) in the same survey. Here, an educated judgement must be made by the user as to the optimal or practical sample size, based on the recommended sample sizes given for each of Option C and D.
For example, if the survey is designed to find out:
household income and maize yields with improved practices, the user will run Option C for each of these parameters and sample the larger of the two recommended sample sizes (both clusters and respondents). If the survey is designed to find out: percent of adoption of improved practices, percent of farmers with irrigated land, and percent of farmers who are ownertillers, THEN the user would run the program Option "D" for each of these variables and record the indicated sample size for each parameter. The results might be as follows:
Population: No. of Clusters: 125 villages  
Size of Clusters: Average of 100 households/village  
Parameter 
Indicated Sample Size 
Household income 
8 clusters/ 20 respondents per cluster 
Maize yields 
10 clusters/10 respondents per cluster 
Proportion 

Improved Practives 
20 clusters 
OwnerTillers 
15 clusters 
Irrigated Land 
12 clusters 
For the quantitative parameters, the sample size would be 10 clusters of 20 respondents per cluster. For the proportional parameters, it would be 20 clusters. If the user wants to get both kinds of information, then presumably for practical reasons, some sacrifice of precision level will be required for "Improved Practices" proportions, since it is impractical to go to 20 clusters for six parameters when only one requires such a large number of clusters.
Each Option in the program can thus be run several times with different sets of available information. The users make their own decision, therefore, as to which sample they find most reasonable and reliable, knowing that the statistical calculations determining that sample are correct. This will be described in more detail for each Option.
One note. Since the program is written in BASIC, it is impossible to introduce some userfriendly features. One major characteristic of the program that the user should note is that it is impossible to go back to the line above if the data entered was mistyped and retype that item of data. Instead the user should enter all the rest of the data, keeping track of that error and run the program as usual. At the end of the Option, the program will ask END PROGRAM (Y/N)? Answer N and then the user can modify any of the existing information. Without changing either DESIGN PARAMETER or PRECISION LEVEL, the user should say YES to changing the "PILOT INFORMATION". This provides a chance to go back and retype the observations correctly and rerun the program.
Inputing. the Data
Option C
The program first asks for information about the population/universe from which the sample is to be drawn. This will usually be in terms of villages, communities, administrative units, or districts. The program asks for the number of such units in the universe, i.e. No. of Clusters? 260.
The program next asks for information about the average cluster size. If a village this might be 45 or 145. If an administrative unit, this might be 500 or more. I.e. Av. Size of Clusters? 145
Next the user indicates the parameter to be studied: i.e. household income. The program asks the user to note the number of clusters for which information is available. If this was on the basis of a rapid appraisal, the number might be 4  10. I.e. No. of Clusters for which Information Available? 7
Next the user inputs the number of observations for the first cluster. Did he/she talk to 5 farmers in that village? Enter 5.
Next the computer asks for the income figures for each of those five farmers. Enter the income for each observation after the question marks.
Next the computer asks about the observations in the second cluster. Enter the number of observations and hit the <RETURN> key. Enter each of the observations at the appropriate question mark.
Continue this process until all the data is entered on all the observations. This program does NOT allow you to backtrack. If you enter incorrect data or miss a line, you must start this process over from the beginning. The computer program will NOT print out the results, so keep a written record of the figures you have entered.
Once the information has been entered
No. of Observations? 5
?25000 
?8900 
?34000 
?22000 
?12000 
No. of Observations? 6
?34000 
?8900 
?7000 
?12000 
?13000 
?8900 
And so on.
THEN hit the <RETURN> key again and the program will tell you about the characteristics of this information. (INTRACLASS CORRELATION, MEAN, STANDARD DEVIATION). The program will now ask you to enter information about the level of precision that you require for the results of your survey. Here you enter the precision level desired as an ordinal number ( Precision level is a statement about how closely you expect your data to cluster around the mean). Remember that the confidence level is presently set at 95% by the program itself.:
Precision Level 
.05 is 5 
.1 is 10 
.2 is 20 
.001 is 1 
You can change this precision level at the completion of the program without entering your data again and the program will adjust the needed size of your sample to fit this new precision level.
It will then tell you the optimal sample size you need to test this parameter. The last step is to enter information about the cost of collecting information in time and logistics. This set of information will be used to allocate the required sample size to clusters. The information about price of a manday is only provided as a service to the user. It is not used in the allocation of sample to clusters, only the data about time cost of gathering the data.
Data: 
No. of days to list elements in cluster: 
Travelling time between clusters: 
Time required to locate elements (household, farmers) within a cluster: 
Time required to conduct survey: 
Unit cost per manday for interviewer: 
This will result in an allocation of sample size to clusters on the basis of logistics involved in reaching the cluster and resulting cost of data collection within that cluster.
You have completed one parameter. You can adjust information for the same parameter  enter new observations, change the precision level desired, or keep the same observations but change the information about collection costs and an adjustment will be made by the program. You can run this for the other quantitative parameters. Or you can end the program and run "D" for proportional design parameters included in your survey.
Option D  SAMPLING DESIGN FOR A PROPORTIONAL PARAMETER
A general note: The program in this option will provide the user with a suggested sample indicating the minimum no. of clusters that should be sampled to get reliable information about a particular parameter. The program may recommend that an unpractical number of clusters must be surveyed. If this is the case, the user should experiment with lower levels of precision and see if this is more practical and if the lower level of precision is acceptable to the enduser of the sample results.
REMEMBER: You are looking at a proportion (a ratio) NOT a percentage. I.E. 0.125 OR 1/8 of the population are adopters, NOT 12.5% of the population are adopters. This will affect how you enter your data. Do not make the mistake of entering percentages, only enter proportions.
Procedure  Option D
The program first asks for information about the population/universe from which the sample is to be drawn. This will usually be in terms of village communities, administrative units, or districts. The program asks for the number of units in the universe, i.e. no. of Clusters? 260
The program next asks for information about the average cluster size. If a village is the sample unit, this might be 45 or 145 households. If an administrative unit, this might be 500 households or more. I.e. Average Size of Clusters? 145
Next the user indicates the parameter to be studied: i.e. proportion of farmers with private tenure, proportion of adopters of an improved technology, proportion of families experiencing a food shortage 6 months of the year or more, etc. This information is keyed in for the user's own purpose.
In response to this question, the computer program asks the user to note the number of clusters for which information is available. If this was on the basis of a rapid appraisal, the number might be 4  10. I.e. No. of clusters for which information is available? 7
Next the user inputs the known proportions of households with the parameter in question. The proportion should be entered as a decimal.
40% becomes .40 
35% becomes .35 
72% becomes .72 and so on. 
1? .40 
2? .35 
3? .72 
4? .55 
5? .30 
6? .65 
7? .44 
Once you have entered information on the all the known clusters, in this case, 7 villages, hit the return key and the program will calculate the statistical relationship between these indicative proportions.
The program then requests that you specify the degree of precision required for data analysis. In most cases, this will fall between 5 and 108. For a .05 precision level (58 precision) enter .05, for a .10 precision level (108) enter .10.
The program then tells the user how many clusters must be sampled in order to evaluate the particular parameter in mind. This Option D should be repeated for each qualitative (proportional) parameter to be studied. The maximum number of clusters indicated is the optimal sample size.