What is an odds ratio in a logistic regression
Introduction
Logistic regression is used when our outcome variable is a categorical variable (or in most cases simply a variable with two states – yes or no). For many people who are new to the field of logistic regression the largest stumbling block for understanding this technique revolves around understanding what an odds ratio is, and how to discuss the confidence with which we can measure an odds ratio. This tutorial aims to address those questions.
To illustrate the concepts in this tutorial we shall consider a hypothetical country where we know each resident’s height, weight, age, and whether they are employed or not.
Linear regression
In the simplest version of linear regression we are interested in the relationship between two continuous variables (eg. the height and weight of each person in our hypothetical country). We could then fit a regression line which might be given by “Weight = 2 * Height + 3”. In this regression line we see that a person who is one centimeter taller will be two kilograms heavier. This number “2” is referred to as our regression coefficient (where the value of 2 is our best estimate for the true value of this regression coefficient), and the uncertainty in our estimate for the regression coefficient is given by a term known as the standard error.
Logistic regression and Odds
In the linear regression our outcome variable was a continuous variable (eg. weight). In logistic regression our outcome variable is a categorical or binary variable (eg. employment), and we introduce a term called the “odds” of being employed. Suppose that our hypothetical country contains 2000 people who are employed and 1000 people that are not employed. In that case the proportion of the people that are employed is 67% or the probability that a person chosen at random will be employed is 67%. The odds is the ratio between those people who are employed and those people who are not employed. In our case that ratio will have the value (2000 people employed / 1000 people not employed) = 2. The odds of being employed is 2 (or for every person unemployed we have 2 people who are employed). It is important to note that we should not use the terms “odds” and “probability” interchangeably. The odds of employment is equal to 2, while the probability of employment is 67%.
Odds ratio
Now let us start to consider what an odds ratio means. To begin with let’s look at the effect of our continuous variable (age) on the odds, and let’s draw up some values within the following table.
Age (years old) |
Number of people employed |
Number of people unemployed |
Probability of employment |
Odds of unemployment |
20 |
100 |
50 |
67% |
2 |
21 |
120 |
40 |
75% |
3 |
In this table we will just consider two ages (20 year olds and 21 year olds). At these different ages we can record how many people were employed and unemployed. We can determine the probability of employment (eg. 100 divided by 100 + 50 means that 67% of the 20 year olds are employed). We can also determine the odds of a 20 year old being employed, where 100 / 50 = 2, or 2 people are employed for every 1 person who is not employed.
We might then be interested in studying how the odds changes as a function of a person’s age. The odds of a 20 year old person being employed is 2, while the odds for a 21 year old person is 3. We can then introduce an odds ratio which talks about the change in the odds. The odds ratio is given by the odds for a 21 year old divided by the odds for a 20 year old, or 3 divided by 2, which equals 1.5. What this means is that as the age increases by one year, then the odds changes by a factor of 1.5 (or a 50% increase in the odds).
We then make the assumption within logistic regression that the odds changes by the same amount as we move from 21 year olds to 22 year olds. We know that the odds for 21 years olds was given by 3, and the odds ratio is 1.5. Hence the odds for 22 year olds is given by 3 multiplied by 1.5, or the odds equals 4.5. For 23 year olds the odds is given by 4.5 multiplied by 1.5, and so on.
If we were to put this result into a paper then we might use statements such as “the logistic regression demonstrated that the odds increased by a factor of 1.5 as the age increased by one year”.
Odds ratios and regression coefficients
In the context of linear regression we obtained a regression coefficient (where for our example of height and weight above the regression coefficient had a value of 2). In the results for logistic regression we will see two numbers being reported, both a regression coefficient (which I shall denote as b) and an odds ratio. There is a simple relationship between these two terms where the odds ratio is given by the expression e^{b} (or e to the power of b). (Keep in mind that e is this constant that has a fixed value approximated by 2.718.
In the context of logistic regression a regression coefficient is difficult to interpret (eg. if you told me that you obtained a regression coefficient of 7 then that would have no practical meaning). A regression coefficient becomes useful when you convert it into an odds ratio. If you told me that you obtained an odds ratio of 1.5 then that has a practical meaning for me.
The uncertainty for our estimated odds ratio
In the context of linear regression we have an estimated value for our regression coefficient and a standard error. The standard error indicates how certain we are in our estimate for the regression coefficient. There is a statistical principle that tells us that there is a 95% probability that the true value for the regression coefficient falls within two standard errors of our best estimate. Hence suppose that the best estimate for the slope of our regression line had a value of 9 and our standard error had a value of 2, then there is a 95% probability that the true value for the slope is within the range between 5 and 13 (or 9 plus or minus 2 times 2). This range is known as the 95% confidence interval.
In the context of linear regression we could present our level of certainty by either reporting the standard error or by reporting the 95% confidence interval. We should note that standard errors are only meaningful if our estimate is normally distributed. The regression coefficient has a normal distribution, an odds ratio does not have a normal distribution. As a result we do not discuss the standard error for an odds ratio, but we can still discuss the 95% confidence interval for an odds ratio.
My personal preference when writing a paper is to include three sets of results for each predictor variable in our logistic regression:
- · The best estimate for the odds ratio
- · The 95% confidence for the odds ratio
- · The p-value for that odds ratio (where an odds ratio of 1 would indicate that the odds does not change for different values of the predictor variable).
Conclusion
In my experience the introduction of the term “odds ratio” is often a major stumbling block in researchers understanding the results from a logistic regression. I would advocate that if a researcher does get stuck at this point then they should draw up a table such as the one above with a different row for each value of our predictor variable X. The columns represent the number of yes’s and no’s for the categorical outcome variable, the probability of obtaining a yes, and the odds of obtaining a yes. An odds ratio then indicates the differences in the odds between the rows in this table. A subsequent 95% confidence interval then indicates how certain we are in our estimate for the odds ratio.
If you would like to find out more:
- Click here for more information on the upcoming workshops that we provide in statistics
- Click here to find out about the consultancy services that we provide for the general public
- Click here to sign up for our monthly newsletter
- Click here for more information about LinkedIn groups that we manage in the field of statistics and survey design
Reader Comments