2 Summarizing Variables
Summarizing data is the process of distilling all the available data down to a few numbers that are intended to describe the entire set of data (all of the observations). Sometimes known as aggregating, this process provides values which can be used in both descriptive (univariate) and concomitant (multivariate) inferences.
Categorical variables are only summarized by a single statistic, the proportion (percent) of the observations in each of the categories. The proportions must add to 1.00 (100%). Proportions may be added together (e.g., the proportion of households with 0 bedrooms + the proportion of households with 1 bedroom = the proportion of households with less than 2 bedrooms). The category with the largest number of observations is termed the Mode.
Ordinal variables are generally summarized in the same manner as categorical variables (e.g., the proportion of cases within a category). At times, the median (the point where ½ the cases are larger and ½ smaller) is used to show the middle of the group.
Numeric variables are summarized by a number of statistics relating to the nature of the distribution of the values of the variable. These generally identify some aspect of the distribution’s central tendency, dispersion and shape.
- The measures of central tendency are, primarily, the mean (average) and the median.
- The usual measure of the dispersion of the observations/data (or how numerically spread out they are) is the standard deviation.
- The two measures that relate to the shape of the distribution of the observations are skew, which refers to the extent to which the observations tend to clump at either the high or low end of the range and kurtosis, which refers, basically, to the proportion of observations distant from the mean (resulting in distributions with peaks that are either ‘flat’ or ‘sharp’).