Correction to median unbiased estimation for the rate ratio

Epitools’ rateratio command uses a median unbiased estimator (MUE). MUEs are described in Rothman, Greenland, & Lash’s Modern Epidemiology (3rd edition, also in earlier editions). Some statistical background: the MUE considers a binomial model for the sampling fraction of cases, taking as fixed or given the incidence ratio in unexposed individuals and the person-year exposure in both exposed and unexposed individuals. In this case the sufficient statistic in the number of exposed cases y. The MUE for the sampling fraction of exposed cases is the probability p so that Pr(Y ≤ y) = Pr(Y ≥ y). When Y is all cases, p is 1 which corresponds to a rate ratio of ∞. The trouble arises when calculating 90% confidence intervals for the MUE. An alternative formulation gives a 95% upper-bound for p by considering a pu so that Pr(Y ≤ y) ≥ 0.95 when the sampling fraction of cases is pu. When a range of values for pu satisfy that condition, we can choose the smallest such pu. This modified definition of an upper bound matches an intuitive understanding that if the RR estimate is ∞, its upper bound must also be ∞.

In Epitools version 0.5-9 and earlier, the call to uniroot from rateratio.midp did not handle boundary cases in this fashion. The nature of the warning was that opposite ends of the midp probability function did not yield results of opposite sign. To address this, simple case handling for this probability root finding algorithm has been included. This permits MUEs of possibly infinite or 0 risk ratios for which 95% confidence intervals and p-values can be reported.




probratio: converting ORs to RRs

It’s no secret that odds ratios (ORs) cannot be reported as risk ratios (RRs). But it bears repeating: odds ratios cannot be reported as risk ratios. It is perhaps the most common fallacy of reporting logistic regression models. At best, the OR is a biased estimator of the RR. A little bias is okay, but how little needs to be described. OR-RR bias is loosely believed to be “less bad” when the prevalence of the outcome is low in the unexposed population. Low is debatable: 10% has been asserted, but bias can be quite bad even at this prevalence. For a continuous exposure, everyone is a “little bit” exposed. This led to Zhang & Yu’s 1989 JAMA correction which, unfortunately, was too optimistic (anti-conservative) in 95% CIs or inference.

Modern regression pre-empts biased analyses in most cases: A great discussion on the topic is found in Lumley, et al. This paper also discusses estimation of RR components using available software. Of the proposed solutions, direct relative risk estimation with a binomial GLM and log link is the most appropriate. Poisson regression could be believed to be unbiased since it also models a log link, and robust error estimation averts any variance structure assumptions. However, Poisson allows the predicted risk to soar above 1 which would normally have been truncated at 1 in other regression: so a rare event (or similar) assumption crops up yet again.

What if we thought beyond “the model”? If the data are population representative, then the predicted values from logistic models are indeed the risk of the event in the population (under mean model assumptions). Muller and MacLehose discuss using marginal standardization to take these predicted values and simply predict the risk in people who are exposed and unexposed , average it up, and take the ratio. It’s (arguably) consistent with a Pearl-esque counterfactual understanding of causality.

The Stata code are out there to do this. R required handiwork. Since the math and programming were simple implementations of matrix algebra and some calculus, I wrote up code to replicate the Stata procedure and have put it on CRAN today.

probratio takes, as arguments, “object” a GLM object of family binomial, a method for converting your ORs to RRs, and a scale on which the inference in performed, among other things. The method is one of three ways: maximum likelihood (ML), delta-method, or bootstrap. ML is just refitting the GLM using relative risk regression, which in R is just as easily done using glm(<your old formula>, data=<your old data>, family=binomial(link ='log')

Note the latter bit. Of course, R explodes when doing this. And it does so often because the initial estimation procedure: naive linear regression to obtain starting coefficients. This ofte. Examples of fixing this are supplied in the examples. The delta-method and bootstrap specifically implements the approach advocated by Muller and MacLehose.

Some caveats to be aware of:

  1. The existing method does not support interactions between parameters. If there is a product term, the assumption that we “manipulate exposure” would also affect the product terms, but the current implementation does not do this.
  2. Relative Risk GLMs are very prone to domination. I’ve included in the examples and warnings indicators for how to improve convergence. However, not all problems will be solvable.
  3. Bootstrap is not bias corrected. I’m not even sure whether the RR is a pivotal quantity in this case. There are other ways of calculating CIs which account for the bias, and they are no less efficient that the biased methods, so using the proper bootstrap implementation seems the right way to go about things.
  4. I haven’t explored weighting too thoroughly. Scott & Wild talk about how this can be done with case-dependent sampling to produce population consistent predictions of risk and their variance. This is assuming, of course, the weights do not have variability or at best are independent of the data used in analysis. For two-phase estimation, well that’s way beyond me.

I’m very curious about this method and there are several avenues of methods exploration:

  1. What happens with model misspecification? I’m thinking about omitted covariates. The predicted risk is not necessarily the right risk due to non-collapsibility of the OR.
  2. Are there conditional likelihood extensions? If we have matched sets and conditional likelihood expressions, can a plug-in estimator of the risk in the matched set serve to adequately predict risk in participants?
  3. Is the RR actually pivotal?

About the author(s)

epitools is authored by everyone. R is a free as in speech and beer, and all packages stemming from it are as well. They have the same “copyleft” agreement. A copyleft agreement means that intellectual property is not to be horded, but shared. This means that you, the user, are also the tester and developer. Please contact the package maintainers frequently and often with questions and ideas.

I (Adam) am the epitools maintainer. The R description summarizes this role perfectly: I’m the one you complain to when something doesn’t work. I hope the nature of the conversation can be mostly focused on the package itself, but I will try my best at empathetic listening if “[the thing that] doesn’t work” is defunding, data disasters, or collaborator tensions.

I am a biostatistician by training, working in public health research at Washington State University’s College of Community Medicine (soon to be subsumed by their medical school). WSU is a landgrant university. Their new medical school has the distinct objective of providing medical service to the surrounding rural regions, including members of Latino and American Indian descent, people without access to health care, and people faced with distinct public health problems–which are actually common but just underrepresented in the majority of urban based research. My interest in programming for community medicine should be apparent: we broach methods which are scant in the broader literature: pragmatic trials, group randomized trials, behavioral interventions, and so on.

Tomás J. Aragón is the epitools author. He recently relinquished his role as maintainer to me; he has more than served his time in being complained to. As an author, he created the package as a concept and a tool from scratch. The original package was largely based out of Kenneth Rothman’s “Epidemiology: an Introduction” text, chapter 4, which is an excellent modern text on Epidemiology. Many of the examples will look familiar to those versed in the epitools, and vice versa.

Other contributors have offered code packages over time and are credited in the package and functions to which they did so. I welcome anyone who may propose ideas for code modifications which are reviewed and accepted to be a contributor.


Welcome to epitools

Hello epidemiologists, biostatisticians, and generally concerned members of the Community:

Together, we comprise the “public” of public health. So, public health research and the recommendations stemming from it implicate us. The public is increasingly aware of bad research and its impact. Part–but not all–of these problems deal with data and analysis. Charged by this, methods epidemiologists and statisticians cobbled together tools to remedy bad research. But how rapid has their acquisition been? Remarkably slow. Dissemination and technology has been part of the difficulty: people with data didn’t know there were tools available to solve their problems, or even if they did, they didn’t know how to use them. epitools is an R package meant to address this limitation.

This accompanying blog serves two roles for you, the userbase: a live alert feed to keep you aware of newly published tools and a way to engage you in a dialogue about the tools and data analysis. To be “aware” of tools means more than simply knowing they are there. To the extent possible, the R documentation should serve as a self-learning tool with descriptions and citations accessible to persons with a non-mathematical background. To compensate for any shortcomings, this blog will discuss the functions, their usage, and the inspiring literature. This blog will also be a needs assessment tool to discuss as yet unimplemented approaches, or methods of improving existing implementations.

Lastly, thank you for your interest in epitools. I truly believe that computational statistics are a form of service to the community. They can be likened to formula 1 cars, whose tuning is specified by you, the driver, but supplied by me, the engineer, and the victory cup is shared. Please cite epitools in any paper using these tools using the command cite(‘epitools’).