Introduction to Categorical Data Analysis


Scale of Measurement

Explanatory Variables vs. Response Variables

The Choice of Statistical Analysis


A categorical data research design is one which uses categorical scale measurement. Let's define categorical data and continuous data.

 Scale of Measurement

There are four types of scales that appear in social sciences: nominal, ordinal, interval, and ratio scales. They are categorized into two groups: categorical and continuous scale data. Nominal and ordinal scales are categorical data; interval and ratio scales are continuous data.

Categorical data having unordered scales are called nominal scales. "Name" is a good example of the nominal scale. Categorical data having ordered scales are called ordinal scale. Rank is an example of ordinal scale. Continuous data having interval scales are called interval scales. Continuous data having both equal intervals and an absolute zero point are called ratio scales.

The reason we should distinguish between them is that the data analysis method is different depending on the scale of measurement. Categorical scale data use nonparametric measures such as logistic regression models and loglinear models. Continuous scale data use parametric measures such as t-test, ANOVA, regression, etc.

 

Explanatory, Response, Control Variables

We have seen the differences between the continuous scale data and the categorical scale data. Let's define an explanatory variable and a response variable. Explantory variables are called independent variables, or X variables. Response variables are called dependent variables or Y variables. Control variables are called Z variables. Continous/categorical variables are called by the characteristics of the data. On the other hand, explanatory, response, and control variables are named after the role of the data in the study.

The explantory variable is one which is used as a predictor of the response variable. Statistical data analysis wants to reveal the effect of the explantory variable on the response variable. In other words, how are responses influenced by the explanatory variable? For example, one might study how the effectiveness of a remedial mathematics program is influenced by factors such as the length of program, text book, gender, classroom conditions, the characterisics of instructor, and the method of instruction, which are all potential explanatory variables.

If the researcher wants to see whether the new instruction method influences the performance of the Remedial Mathematics Program participants, she can design a 2*2 clinical trial experiment. In this case, the response variable is whether to pass or fail the final examination; the explantory variable is the two methods of instruction--computer aided instruction and tutoring. What she has to do in this study is to count the number of students who pass or fail in each instruction method. The following contingency table shows how information can be summarized in the 2*2 table.

<Table 1> Performance of Remedial Mathematics Program (RMP)(2 * 2 Table)

Instruction Method Pass Fail Total
Computer-Aided Instruction (CAI) 55 10 65
Tutoring 32 3 35

The statistical analysis of the 2*2 contingency table depends on a chi-square test, whose null hypothesis is the explanatory variable and the response variable are independent, which means no relationship between the explanatory and the response variable. The alternative hypothesis is they are not independent, meaning that some relationships between them exist. If the p-value of the chi-square is small, i.e p<.05, it is concluded that there is an association betwen the instruction method and the student performance in RMP.

<Table 1> is the simplist example in the categorical data analysis, i.e, 2(instruction methods) * 2(performance). It is possible to include another explanatory variable. For example, if the researcher wants to see the effect of school area in the above example, the research design is changed into 2(instruction methods) * 2 (performance) * 2 (school area) design. In this case, the school area variable is called the control variable. The use of the control variable is like using "block design" or "stratified sampling" methods. <Table2> shows the 2*2*2 contingency table.

<Table 2> Performance of Remedial Mathematics Program (RMP) (2 * 2 * 2 Table)

School Area Instruction Method Pass Fail Total
Urban CAI 34 6 40
Tutoring 18 2 20
Rural CAI 21 4 25
Tutoring 14 1 15

Further data analysis procedure is explained in "The 2*2 Contingency Table" and "The 2*2*2 Contingency Table" sections.

 

The Choice of Statistical Analysis

Regression methods describe the relationship between the response variable and one or more explanatory variables. Usually, it is said that regression methods are used with continuous response (dependent, or Y) and explanatory (independent, or X) variables. Most statistical methods we have learned depend on continuous data. However, we sometimes have binary responses such as 'yes or no', 'male or female', or 'success or failure'. When the responses are measured with binary data, it should be treated as categorical data and the number of responses should be counted. When explanatory variables are not continuous, i.e, dichotomous, the dummy variable are applied to distinguish the differences among dichotomous groups. It is called a regression approach to ANOVA. On the other hand, when response variables are discrete, taking on two (binary) or more dichotomous values, the logistic regression model is considered.

Depending on the characteristics of the data (continuous/categorical) and the role of the data (explanatory/response) in the research, the appropriate statistical data analysis method should be chosen. <Table 3> gives a quick guideline of choosing data analysis based on the characteristics of independent (explanatory) and dependent (response) variables.

<Table 3> The Types of Statistical Analysis

  


Go to Home Page

Go to Outline


Copyright © 1997, Hee-Jae Cho