Logistic Regression Models and Loglinear Models


Introduction

Logistic Regression Model vs. Loglinear Model

Logistic Regression Models

Loglinear Models


Introduction

The analysis of r*c tables is primarily aimed to investigate the statistical association by testing the null hypothesis of no association. In addition to testing association, the strength of association are measured by the difference of proportions, the relative risk, and the odds ratio. In this section, I would like to introduce two modeling methods with categorical data.

 

Logistic Regression vs. Loglinear Model

Two statistical models with categorical data are introduced. One is a logistic regression model and the other is a loglinear model. The choice of the model depends on the characteristics of explanatory variables. If the response variable is categorical data, and the explanatory variables are categorical and/or continuous data, the logistic regression model should be used. If both explanatory and response variables are categorical data, the loglinear model should be used. <Table 3> gives the guideline of selecting a statistical analysis method depending on the characteristics of data.

 

Logistic Regression

Logistic regression is a statistical modeling method that is used for categorical response variables. It describes the relationship between the categorical response variable and one or more continuous and/or categorical explanatory variables.

Logistic regression is used when explanatory variables are either continuous or categorical and response variables are categorical. For instance, we are interested in the relationship between smoking and lung cancer. The explanatory variable is whether to smoke (smoking or nonsmoking group), and the response variable is whether to have lung cancer. In this case, we have the 2 * 2 case-control design, because we have two levels in explanatory variables (smoking / nonsmoking) and two responses in response variables (cancer / no cancer). If we are also interested in the role of age, we can add "age" as continuous or categorical data. It will be easier to start with the data matrix we can have in either case.

If the age is the continuous explanatory variable, the data matrix looks like the following table. It is an ungrouped data set.

Age

(Continuous)

Smoking

(Yes=1/ No=0)

Cancer

(Yes=1/ No=0)

36

1

0

47

0

1

49

1

0

29

1

1

60

0

1

55

1

1

65

1

0

38

1

1

56

0

1

On the other hand, if the age variable is categorized into three age groups, under 40, 41-60, over 61, we have three age group and the age variable is the categorical variable. In this case, it is possible to count the number of people in each cell of the contingency table. The following table summarizes the results of all three categorical variables. It is a grouped data set.

Lung Cancer
Age Group Smoking Yes No
Under 40 Smoking 15 4
41~60 Smoking 30 7
Over 60 Smoking 26 6
Under 40 No Smoking 8 2
41~60 No Smoking 14 2
Over 60 No Smoking 15 3

We call it the 2(Smoking)* 2(Lung Cancer)* 3(Age Groups) contingency table, because we have two levels of smoking, two levels of cancer, and three levels of age groups.

The logistic regression model tests whether smoking has an effect on lung cancer and whether the age effect on lung cancer exists and whether there is an interaction between smoking and age group and tries to find the best model which can predict the chance of lung cancer with the smoking and age variables.

 In short, the logistic regression model is useful when the study is interested in the relationship between the categorical response variable and the categorical and/or continuous explanatory variables.

 

Loglinear Models

Loglinear Models are mostly used when at least two variables in a contingency table are response variables. It is useful to a model which describes association patterns among a set of categorical response variables.

For example, we are interested in relationships among coffee drinking, tea drinking, smoking use, and gender. All four variables are all responses--do you drink coffee everyday?(C=yes, no), do you drink tea everyday? (T=yes/no), do you smoke everyday? (S=yes/no), are you male or female? (G=male/female). When the study has at least two categorical response variables, loglinear models are mainly useful.

The 2 * 2* 2* 2 contingency table we can make with four variables is shown in the following table.

Coffee
Gender Smoking Tea Yes No
Female Yes Yes 15 5
No 30 14
No Yes 17 8
No 14 2
Male Yes Yes 23 6
No 15 11
No Yes 18 7
No 5 1

With the loglinear model, we can describe six two-way associations, and four three-way association as well as four main effects. Six two-way associations are coffee*tea, coffee*smoking, coffee*gender, tea*smoking, tea*gender, and smoking*gender. Four three-way associations are coffee*tea*smoking, coffee*tea*gender, coffee*smoking*gender, and tea*smoking,*gender. The loglinear model tests whether each association is significant in the model.

In short, when there are sets of categorical response variables and there is no distinction between the response and explanatory variable, the loglinear model provides a good statistical analysis for testing associations and interactions among sets of categorical response variables.

 


Go to Home Page

Go to Outline


Copyright © 1997, Hee-Jae Cho