Examining binomial regression of events/trials and overdispersion

Melanie Wall

2023-06-24

knitr::opts_chunk$set(echo = TRUE)

This document shows different ways to code/model binomial count (event/trials) data in r glm() packange and also demonstrates the results when we aggregate data across units and also when we account for overdispersion across units.

The data for demonstration are about Bird Extinctions from the R package Sleuth3. To get it: install.packages(“Sleuth3”). Data (shown below) are from 18 islands (Krunnit Islands in Finland) and are counts of how many different bird species there are, i.e. “trials” (AtRisk) and how many species went extinct, i.e. “events” (Extinct)

For illustration, I created a dichotomous predictor variable called group which is 0 for big islands and 1 for small islands

library(Sleuth3)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

## dplyr 1.1.2 readr 2.1.4

## forcats 1.0.0 stringr 1.5.0

## ggplot2 3.4.2 tibble 3.2.1

## lubridate 1.9.2 tidyr 1.3.0

## purrr 1.0.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

## dplyr::filter() masks stats::filter()

## dplyr::lag() masks stats::lag()

## Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

krunnit <- case2101


####create a grouping variable for demonstration

krunnit$group = if_else(krunnit$Area < 10,1,0)


# > krunnit

# Island Area AtRisk Extinct group

# 1 Ulkokrunni 185.80 75 5 0

# 2 Maakrunni 105.80 67 3 0

# 3 Ristikari 30.70 66 10 0

# 4 Isonkivenletto 8.50 51 6 1

# 5 Hietakraasukka 4.80 28 3 1

# 6 Kraasukka 4.50 20 4 1

# 7 Lansiletto 4.30 43 8 1

# 8 Pihlajakari 3.60 31 3 1

# 9 Tyni 2.60 28 5 1

# 10 Tasasenletto 1.70 32 6 1

# 11 Raiska 1.20 30 8 1

# 12 Pohjanletto 0.70 20 2 1

# 13 Toro 0.70 31 9 1

# 14 Luusiletto 0.60 16 5 1

# 15 Vatunginletto 0.40 15 7 1

# 16 Vatunginnokka 0.30 33 8 1

# 17 Tiirakari 0.20 40 13 1

# 18 Ristikarenletto 0.07 6 3 1

Two ways of coding a binomial in glm:
1) proportion (events/trials) as outcome and weights = trials,
2) cbind( events, trials – events) as bivariate outcome and no need to include weights
See below

krunnit$prop = krunnit$Extinct/krunnit$AtRisk


###uses proportion as outcome with weight = number of trials (“AtRisk”); n=18 islands

fit1<-glm(prop ~ group, family = binomial(), weights = AtRisk, data = krunnit)


###uses number of events (“Extinct”) and number of non-events (“AtRisk – Extinct”) as bivariate outcome (using cbind); n=18 islands

fit2<-glm(cbind(Extinct, AtRisk – Extinct) ~ group, family = binomial(), data = krunnit)

What about if we aggregate events across islands so that the data is summarized just by group?

###aggregates events and non-events across islands


aggkrunnit<-krunnit %>% select(AtRisk, Extinct, group) %>% group_by(group)%>%

summarise(totAtRisk=sum(AtRisk),

totExtinct = sum(Extinct))%>%

mutate(totprop = totExtinct/totAtRisk)

aggkrunnit

## # A tibble: 2 × 4

## group totAtRisk totExtinct totprop

## <dbl> <int> <int> <dbl>

## 1 0 208 18 0.0865

## 2 1 424 90 0.212

###uses aggregated data and proportion as outcome with weight = number of trials (“AtRisk”) n=2 groups

fit3<-glm(totprop ~ group, family = binomial(), weights = totAtRisk, data = aggkrunnit)


###uses aggregated data number of events (“Extinct”) and number of non-events (“AtRisk – Extinct”) as outcome n=2 groups

fit4<-glm(cbind(totExtinct, totAtRisk-totExtinct) ~ group, family = binomial(), data = aggkrunnit)



summary(fit1)

##

## Call:

## glm(formula = prop ~ group, family = binomial(), data = krunnit,

## weights = AtRisk)

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -2.3567 0.2466 -9.556 < 2e-16 ***

## group 1.0453 0.2737 3.819 0.000134 ***

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 45.338 on 17 degrees of freedom

## Residual deviance: 28.185 on 16 degrees of freedom

## AIC: 91.517

##

## Number of Fisher Scoring iterations: 4

summary(fit2)

##

## Call:

## glm(formula = cbind(Extinct, AtRisk – Extinct) ~ group, family = binomial(),

## data = krunnit)

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -2.3567 0.2466 -9.556 < 2e-16 ***

## group 1.0453 0.2737 3.819 0.000134 ***

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 45.338 on 17 degrees of freedom

## Residual deviance: 28.185 on 16 degrees of freedom

## AIC: 91.517

##

## Number of Fisher Scoring iterations: 4

summary(fit3)

##

## Call:

## glm(formula = totprop ~ group, family = binomial(), data = aggkrunnit,

## weights = totAtRisk)

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -2.3567 0.2466 -9.556 < 2e-16 ***

## group 1.0453 0.2737 3.819 0.000134 ***

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 1.7153e+01 on 1 degrees of freedom

## Residual deviance: -2.6201e-14 on 0 degrees of freedom

## AIC: 14.748

##

## Number of Fisher Scoring iterations: 3

summary(fit4)

##

## Call:

## glm(formula = cbind(totExtinct, totAtRisk – totExtinct) ~ group,

## family = binomial(), data = aggkrunnit)

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -2.3567 0.2466 -9.556 < 2e-16 ***

## group 1.0453 0.2737 3.819 0.000134 ***

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 1.7153e+01 on 1 degrees of freedom

## Residual deviance: -2.6201e-14 on 0 degrees of freedom

## AIC: 14.748

##

## Number of Fisher Scoring iterations: 3

Note that the results from all 4 ways of modeling above (fit1, fit2, fit3, fit4) give the EXACT same coefficients and EXACT same standard error for group. That is, it doesn’t matter whether we use the coding method with “prop” or the “cbind(event,trial-event)” as outcome. And, more importantly and interestingly, it doesn’t make a difference whether we use all 18 sets of counts (1 for each island) separately (like in fit1 and fit2) or else include only the aggregated data across islands by group (n=2) (like in fit3 and fit4).

The fact that the standard errors are the same whether we aggregate or not pointed to something important about what the binomial model is assuming. The reason the disaggregated and aggregated are the same is that the binomial model has only 1 parameter (i.e. the probability of an event) and the regression model is actually assuming all the trials are independent across all islands. The model doesn’t care/notice that the data come from different islands. It is focused on just modeling the probability of an event out of all the possible trials aggregating across big and small islands.

But, this is a problem because we know the species at risk are clustered within islands and so there is another variability which should probably be considered which is island-island variability. This extra variability due to clustering in binomial data (and also in Poisson data) is often called “overdispersion”. There are multiple ways to incorporate it (See this paper https://journals.sagepub.com/doi/pdf/10.1177/0962280215588569 ), but the simplest is to adjust the standard errors by a scale factor either using the Pearson or the Deviance residuals which quantifies how much the variabilty in the residuals deviates from that expected if all islands had the same exact true probability of an event (after accounting for covariates).

Notice in the binomial output it says “(Dispersion parameter for binomial family taken to be 1)”. We can change that by using family = quasibinomial() as below:

##### Using quasibinomial() to account for island to island variability in proportions


fit1disp<-glm(prop ~ group, family = quasibinomial(), weights = AtRisk, data = krunnit)

fit2disp<-glm(cbind(Extinct, AtRisk-Extinct) ~ group, family = quasibinomial(), data = krunnit)


summary(fit1disp)

##

## Call:

## glm(formula = prop ~ group, family = quasibinomial(), data = krunnit,

## weights = AtRisk)

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -2.3567 0.3324 -7.091 2.56e-06 ***

## group 1.0453 0.3689 2.834 0.012 *

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## (Dispersion parameter for quasibinomial family taken to be 1.816271)

##

## Null deviance: 45.338 on 17 degrees of freedom

## Residual deviance: 28.185 on 16 degrees of freedom

## AIC: NA

##

## Number of Fisher Scoring iterations: 4

summary(fit2disp)

##

## Call:

## glm(formula = cbind(Extinct, AtRisk – Extinct) ~ group, family = quasibinomial(),

## data = krunnit)

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -2.3567 0.3324 -7.091 2.56e-06 ***

## group 1.0453 0.3689 2.834 0.012 *

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## (Dispersion parameter for quasibinomial family taken to be 1.816271)

##

## Null deviance: 45.338 on 17 degrees of freedom

## Residual deviance: 28.185 on 16 degrees of freedom

## AIC: NA

##

## Number of Fisher Scoring iterations: 4

The dispersion parameter is 1.816271 above. Note that the coefficient estimates for the effects for group are the same as when we just used family = binomial earlier, but now with the family = quasibinomial, the STANDARD ERRORS and consequently P-VALUES are different (bigger). Specifically, the group coefficient (SE) = 0.3689 now whereas before it was 0.2737 without overdispersion. We can see that: 0.3689 = sqrt(1.816271)*.2737. That is, the standard errors are now inflated by sqrt of the overdispersion parameter. CONCLUSION, If we want to test the group effect on the probability of a species going extinct, it is more accurate to account for the fact that species are clustered within islands and not all islands have the same probability of bird species going extinct, hence it is better to adjust/inflate the standard errors for “overdispersion”.

And, if we try (below) to use quasibinomial with the aggregated data, as expected it will not work because there is not any residuals that can be used to get the estimate for the overdispersion, i.e. there is no way to check if there is any variabilty at the level island level when we collapse across islands

fit3disp<-glm(totprop ~ group, family = quasibinomial(), weights = totAtRisk, data = aggkrunnit)

summary(fit3disp)

##

## Call:

## glm(formula = totprop ~ group, family = quasibinomial(), data = aggkrunnit,

## weights = totAtRisk)

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) -2.357 NaN NaN NaN

## group 1.045 NaN NaN NaN

##

## (Dispersion parameter for quasibinomial family taken to be NaN)

##

## Null deviance: 1.7153e+01 on 1 degrees of freedom

## Residual deviance: -2.6201e-14 on 0 degrees of freedom

## AIC: NA

##

## Number of Fisher Scoring iterations: 3

Posted in Uncategorized | 1 Comment

Data scientists, myself included, need regular reminders about problems of interpreting odds ratios from logistic regression

I had a collaborative paper reviewed this week that had in it odds ratios (OR) from logistic regressions (used to examine whether one way of measuring norms was a stronger predictor of a dichotomous behavior than another measure of the norm).  A reviewer said “The comparison of odds ratios across logistics regression models has been criticized as this approach might be subject to an unobserved heterogeneity bias (C Mood 2010, European Sociological Review). In order to solve this issue, the authors may want to consider the use of Average Marginal Effects as it is suggested in the study by Mood 2010.”   The review is right, and I should have caught this before signing-off on the analyses in the paper.  That pesky non-linear logistic function (the s-shape curve for the probabilities) creates problems of interpretation of coefficients from logistic regression, especially when we want to compare them to one another (e.g. when looking for interactions, when looking at sequential models like in mediation, or when simply comparing strengths of association between predictors as in the collaborative paper mentioned above).   The interpretational problem emphasized by the Mood paper is the following: that in logistic regression the OR for one variable (X1) can change when we control for another variable (X2) even when X1 and X2 are not correlatedThe same is not true in linear regression.  In linear regression if X1 and X2 are uncorrelated then the beta for Y regressed on X1 will be the same whether we control for X2 or not.  This phenomena in logistic regression (due to so-called ‘unobserved heterogeneity bias’) can create problems when you are trying to interpret differences in coefficients (i.e differences in ORs).

One straightforward way to avoid (or at least limit) these problems with using OR from logistic regression is to switch to using the Average Marginal Effects which transform back to the probability (rather than the log odds) scale and aggregate (i.e. marginalize) the effect estimates on the probability scale.  You still fit a logistic (or log link) model but you don’t use the coefficients directly as effect estimates (i.e. you don’t use exp(beta) = OR), instead you use expected contrasts formed back on the probability scale.  This can be done by the MARGINS command in commonly used software (in Stata, or SAS has a macro, or R package).

Another way to avoid these problem with using OR from logistic regression is to just  use linear regression with the 0-1 outcome.  I know, I know, if you have any experience doing data analysis (especially if you have a quantitative degree and took statistics classes) you may be saying, “enough is enough, a 0-1 variable *can’t* be fit with a linear regression model, it is clearly not continuous,  it is clearly not normally distributed, the predictions will be outside the parameter space (0,1),  and the relationship can’t be a straight line – all the assumptions of linear regression are wrong”.   By using linear regression you are modeling the probabilities of Y=1 directly.  The advantage of the linear regression is that it avoids the problem of the unobserved heterogeneity bias that creeps into logistic regression and we can interpret the effect estimates as the average risk difference in the outcome for a one unit change in the predictor.  Recall that linear regression is modeling the E(Y) (i.e. expected or mean value of Y).  When Y is a 0-1 variable then E(Y)  is just the probability of Y=1.  Indeed, the beta from the linear regression of a 0-1 outcome is a consistent (i.e. asymptotically unbiased) estimator of the Average Marginal Effect.  Standard errors need a little care since the typical ones from linear regression rely on the normality assumption, but you can use robust standard errors or else use a linear binomial model.   Economists have been using linear regression for 0-1 outcomes for a long time, see https://statisticalhorizons.com/linear-vs-logistic.

Finally,  if you are a regular user of logistic regression, I want to give a strong recommendation to read the paper by Carina Mood “Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It”  European Sociological Review , FEBRUARY 2010, Vol. 26, No. 1 , pp. 67-82.  She wrote a terrifically clear close look into the problem and describes the solutions that I mention above.  At the time she wrote the paper in 2010, there were not yet software options in STATA, SAS, and R for calculating the Average Marginal Effects (i.e. MARGINS commands), but now that there is, it seems to me that we should really start using them more than the OR in order to produce results with better interpretability.  I know I will.

 

Posted in Uncategorized | Leave a comment

Towards empirical methods for studying intersectionality

Towards empirical methods for studying intersectionality in research

The link above contains slides I put together in Jan-March 2021 for a talk before two audiences of Research fellows at Columbia University (the Psychiatric Epidemiology training program and the LGBT health training program).   

This topic is a work in progress.  Please feel free to share any comments with me mmwall@columbia.edu.

 

Posted in Uncategorized | Leave a comment

Data science methods for better understanding intersectionality

In addition to reconsidering how we operationalize racialized categories in our statistical models, to move towards a more anti-racist research agenda we must also consider intersectionality (ethnicity, gender, age, sexual orientation, social economic class, disability).   I am grateful for this terrifically written article by Greta Bauer from 2014 that I only wish I would have found sooner…Incorporating intersectionality theory into population health research methodology: Challenges and the potential to advance health equity that clearly describes both theory and methods. https://www.sciencedirect.com/science/article/pii/S0277953614001919

Of high relevance to data scientists doing the modeling work  see section “4.5 Understanding differences between types of regression models for intersectional applications” Dr. Bauer reviews why we should be estimating interaction on the “additive-scale” rather than the “multiplicative-scale” and encourages researchers to use related measures like RERI, or the synergy index.  Data scientists who are participating in meaningful health research should become aware and stop just testing cross-product terms in logistic regression when trying to test if “risks interact”! Link to a presentation on testing interactions

Posted in Uncategorized | Leave a comment

Refocusing what we mean by race as a variable used in statistical analysis

In meeting today a colleague said “It’s interesting that there was a time in the not too distant past when data analysis of outcomes by race was discourage and if it were to be done needed to have a strong justification for why that was necessary, and now it seems like we have done a 180 and want to emphasize it”

With the growing discourse/reckoning that is happening thanks to the Black Lives Matter movement and the urgent need for white people in the U.S. to embrace anti-racist actions, I as a statistician have some areas close to home I can start focusing on.  I have analyzed LOTS of data about human public and mental health in my career. Race/ethnicity almost always is included as a covariate and sometimes as an effect modifier (interaction by race).

The issue of importance about race as a variable for study in statistical analysis is about what we mean by racial categories.  Those who have in the past tried to caution against looking at race are (probably) worried about us slipping into our eugenics past when race was thought of as a biological characteristic of someone.    But what we need to do is to clarify that race is not only an individual level characteristic (and certainly not a biological characteristic) of a person but instead a characteristic of a person’s level of exposure to a racist and white-supremest society and environment.  The way we use race in analysis should not be thought of as a variable that describes the individual themself, but instead describes their collective experience of being identified and treated by the society in a certain way.

When we use race without careful explanation in our statistical analysis, readers are left to think of it in any way they choose, which may allow for the perpetuation of racist interpretations (e.g. for example black-white differences being attributed to individual or cultural level inadequacies).  I think it is our job as communicators of data and interpreters of findings to help lead the reader to interpret race from its institutionalized lens as a measure of the racist context a person lives within.  One simple suggestion is to to start adding footnotes to race variable whenever it is included in a statistical table.  Perhaps the footnote would say: “Race/ethnicity categories delineate differences of experiences in the person’s lifelong exposure to a racist society/institution”. Something like that, suggestions/improvements welcome.

We must do better.
Posted in Uncategorized | 1 Comment

From colleagues at New York State Office of *Mental Health*

The outbreak of COVID-19 around the world has led to the spread of fear and panic for individuals and communities. In addition to following physical precautions guidelines, individuals should be taking care of their psychological well-being.

This guide includes tips for the following populations:

  • For Everyone
  • For Individuals Receiving Mental Health Services
  • For Parents, Including Parents of Children with Pre-Existing Anxiety Disorders
  • For Caregivers of Elderly Individuals
  • For Mental Health Providers

For Everyone: 

  • Reduce anxiety by reducing risk. Ways to reduce risk include practicing good hygiene (e.g. sneezing and coughing into your elbow, sneezing into a tissue and immediately throwing the tissue away, wash hands regularly with soap and water for at least 20 seconds, etc.). In addition, create a plan in case your regular routine is disrupted, such as setting up remote work and alternative childcare arrangements. Setting out a plan can help reduce anxiety by taking charge of the things you can control.

 

  • Manage your information flow by choosing reliable sources and establish boundaries on checking for updates. Getting regular, factual information is important. However, continuously scrolling through social media or constantly refreshing the news is likely to lead to increased anxiety. Pick a few trusted news outlets – such as the state and local health authoritiesCenters for Disease Control and Prevention, or World Health Organization – and commit to checking once or twice a day for updates.

 

  • Monitor your anxiety levels. Anxiety is a normal response to a stressful situation and can provide adaptive benefits in many situations. However, when faced with mounting uncertainty, your brain can go into an anxiety spiral that is no longer helpful. Knowing the difference between typical and atypical stress is important. Monitoring your stress level will let you know when you need to seek additional help.
    • A typical stress reaction may include: temporary difficulty concentrating; irritability and anger; fatigue; stomachache; and, difficulty sleeping.
    • An atypical stress reaction may include a persistent and/or excessive worry that doesn’t lift and keeps you from carrying out your daily tasks. If you experience significant changes in your energy level, eating patterns, or sleeping patterns, difficulty concentrating on normal tasks, prolonged and overwhelming worry and hopelessness, or thoughts of self-injury or suicide, seek out immediate help at 1-800-273-TALK (8255) or text Got5 to 741741.

 

  • Practice good self-care, including exercise, eating healthy foods, and sleeping an adequate amount at night. If possible, spend some time outside. Avoid staying up late to monitor the news.

 

  • Virtually reach out to different types of support networks, such as family, friends, colleagues, faith-based communities, and social organizations to strengthen your overall feeling of connection. Isolation and loneliness feeds anxiety.

 

  • Find meaningful tasks and roles within your support network to channel your anxiety, such as coordinating deliveries of groceries to those unable to leave home, curating kids’ activity ideas for parents working from home, or video calling or calling those who might feel socially isolated. Supporting others is beneficial to the supporter as well.

 

  • Find or create spaces that are not focused on COVID-19. Start a social media thread about other topics, ask friends to discuss other topics, or watch your favorite TV or movie.

 

  • Savor small positive moments, amplify positive stories, and stay optimistic. Try to cultivate a mental wellness practice, such as writing in a gratitude journal, or talking nightly with your family about moments during the day that were fun or enjoyable.

 

  • Take an opportunity to practice mindfulness when managing anxiety. Mindfulness tools like grounding exercises, sensory modulation, and deep breathing may be helpful.

 

For Individuals Receiving Mental Health Services: 

  • As soon as possible, work with your mental health provider on a coping plan. Think about helpful coping skills you can practice daily and be mindful to those coping skills that you may turn to that are otherwise harmful to your safety and well-being. For example, if you know that music, walking outside, reframing your thoughts, and connecting with others are helpful, think about ways you can incorporate those into your daily life. If you know that you might struggle with ruminating, self-injury, substance use, or other strategies that might be harmful to your safety and well-being, identify alternative coping methods with your provider. Write out a plan to help prepare you for heightened anxiety.

 

  • Work with your mental health providers on specifically managing anxiety and ask them to help you come up with practical skills that you can rehearse.

 

  • Work with your mental health providers on alternative options if your routine services are disrupted. These might include usingtelemental health services, getting prescription medication, or engaging in supplemental mental wellness activities.

 

  • Seek positive peer support. Connect yourself to others who understand your experiences and can assist in problem-solving. If social distancing increases feelings of isolation, look into online peer supports or peer hotlines.

 

For Parents, Including Parents of Children with Pre-Existing Anxiety Disorders: 

  • Think about and rehearse scripts for talking with your kids about COVID-19. Kids take cues from caregivers about how anxious they need to be about a topic. Seek out resources and media to assist in your preparation.

 

  • Talk about the situation openly. Most kids elementary-aged and up have heard about COVID-19 or coronavirus. Avoiding the topic or providing blanket reassurances is more likely to feed anxiety. If kids bring up the topic, let them know you are glad they brought it up. This increased the likelihood that they will come to you with further anxieties or questions.

 

  • Don’t give more information than is requested. Part of a developmentally appropriate approach is to answer the question your child asks, but not necessarily more than that. Check to make sure they understood your response by asking them to repeat back what they heard, and let them know you are open to more questions. Reassure your child that it is normal to feel scared or anxious.

 

  • Help your school-aged child and adolescent set boundaries on their information flow in the same way you are setting your own boundaries. Help them identify factual sources of information and set appropriate intervals to check in. Encourage them to use their media literacy skills to question the messages they are getting from various information channels. Consider limiting media exposure or consuming media with your child so that you can be available to interpret and explain information.

 

  • Keep as many routines intact as possible. For kids who may be out of school and/or have extra-curricular activities cancelled, it is helpful to keep other routines, like mealtimes and bedtimes. To the extent possible, for kids who are at home for longer periods of time, set up a structure. Collaborate with your child to come up with a loose schedule, such as an outdoor activity and lunch prep in the morning, and a movie and homework time in the afternoon.

 

  • Find fun ways to maintain contact with individuals your child is separated from, such as elderly grandparents or classmates at school. Set up opportunities to maintain and even grow connections, such as reading a book to grandparents on video call or sending postcards to friends.

 

  • Encourage physical activity and time outside, where possible. Both staying active and having opportunities to be in nature are helpful with mitigating anxiety and building resilience.

 

  • Use this as an opportunity to teach distress tolerance skills that will be helpful to your kids in any situation. This is a great time to learn about purposeful breathing, guided imagery, distraction, and other skills.

 

For Caregivers of Elderly Individuals: 

  • Facilitate ways for the individual to maintain social connections. As the elderly have been told to isolate as much as possible, it is likely that social isolation and loneliness may take a toll on physical and mental health. Set up and provide technological assistance for family and friends to stay connected to the individual. Consider coordinating a group of people to check in on a rotation so that the individual feels the support of a network.

 

  • Encourage the individual to stay as active as possible, for both their physical and psychological well-being.

 

  • Help the individual find ways where they can help others, such as calling others to check in on them or entertaining grandchildren on FaceTime. Having a purpose and role can reduce anxiety.

 

  • Consider practical ways you can relieve an individual’s anxiety, such as volunteering to order their groceries online or offering to walk the individual’s dog(s).

 

  • In a time of high anxiety, it may be hard for the individual to select reliable sources to get information and updates on COVID-19. Curate a list of reputable media and write them down for the individual.

 

  • Practice self-care and be compassionate to yourself. While caregiving is a demanding and rewarding role at the best of times, being a caregiver during a time of heightened concerned is particularly stressful. If possible, find a way to take small breaks, rotate responsibilities with others, and practice your own mental health strategies.

 

For Mental Health Providers: 

  • Place a priority on self-care, including getting adequate rest and exercise, eating healthy food, maintaining social connections, and taking time away from service provision as possible.

 

  • Prepare for heightened anxiety in the individuals in your care and prepare your own toolkit on skills and scripts that might be helpful.

 

  • Work with your colleagues to prepare back-up plans for crisis management, such as telemental health or alternate therapeutic arrangements, so that you are prepared if there is a disruption in services. Work with your supervisor and colleagues to rotate functions and cross-train as much as possible

 

  • Set up peer supports, such as peer supervision and consultation, to connect with others who are in a similar situation. Setting up spaces to discuss the toll of vicarious trauma and anxiety is an important part of self-care.

 

  • Seek out professional help as needed. Remember that provision of mental health care during a crisis is challenging and it is critical that you address your own stress and anxiety.
Posted in Uncategorized | Leave a comment

For the Columbia Biostatistics department newsletter…

Mailman School of Public Health at Columbia University asked to “interview” me for a profile about biostatistics faculty.  Here’s what I said…

Tell us about yourself, your background:  I am originally from St. Louis and a first-generation college student.  My dad was an apartment building maintenance man and my mom stayed at home to take care of my two siblings and me while working part-time telephone marketing jobs.  I went to a public school in a blue collar suburb that was part of the desegregation intervention of the 1980’s to mix black children from the city into predominantly white schools in the suburbs.  I mention this because I think it left an important imprint on me about the many values of diversity in education and in society more broadly.   A high school guidance counselor encouraged and helped me to apply for scholarships and a Pell grant for college. With that, and also with part time jobs at McDonald’s and a paper factory, I paid my way through college at Truman State University in Kirksville Missouri to get a BS in mathematics with a minor in physics.  My college advisor encouraged me to go to graduate school, suggesting I go into math at Iowa State University.

I was accepted into the math program at Iowa State with a teaching assistantship, but after my first semester I realized I was much more interested in statistics.  I transferred to the statistics department, where I was the sole instructor for an undergrad statistics class each semester for the next 4 years.  I finished my PhD in 1998 with a dissertation titled “On nonlinear structural equation modeling” working with Yasuo Amemiya.

I then took a position as an assistant professor of biostatistics in the school of public health at the University of Minnesota.  There I found the fast-paced world of NIH research grants and collaborations with epidemiologist, health policy researchers, and medical doctors to be very stimulating.  I collaborated on research projects examining predictors of obesity and substance abuse using surveys and administrative databases.  I was the statistical reviewer on many NIH review panels and also started a graduate course on latent variable and structural equation modeling for health sciences which was new for the department of biostatistics.  I was eventually promoted to full professor with tenure.

When did you come to Columbia?  In 2010 I joined to the Columbia University psychiatry department and the New York State Psychiatric Institute (NYSPI) which has a very large and active research portfolio with its own dedicated biostatistics group.  Given my expertise in latent variable models and psychometrics and my enthusiasm for applied collaborations, it was a very good fit.  In 2012, I was appointed the director of the biostatistics division in the Department of Psychiatry/NYSPI and over the last 7 years, I have grown the division to include 14 full-time biostatisticians.

What are the some of the research areas in psychiatry you collaborate on?   The division of biostatistics in psychiatry participates in extensive statistical collaborations with over 50 psychiatry researchers in any given year, providing statistical expertise and analysis to projects, including: biomarkers from brain imaging and neurocognitive tasks for mood and anxiety disorders and psychosis; clinical trials of new pharmacotherapies and psychotherapies for psychiatric and neurological disorders, including building treatment decision rules; implementation studies of support service programs for mental health treatment and prevention; measurement studies for improving psychometrics of diagnosis instruments in substance use disorders, depression, and biological aging; cohort studies of child-adolescent development of psychiatric and substance use disorders; causal analysis of prescribing practices monitored from medical claims records; momentary assessment studies of cardiovascular response to emotions and stress markers related to suicide; and much more.

Why change the name of the division to Mental Health Data Science?  In 2018, I decided to rename the biostatistics group in psychiatry to be called the division of Mental Health Data Science (https://www.columbiapsychiatry.org/mental-health-data-science).  Our new name really represented a “re-branding” which was meant to exemplify our mission to embrace new and emerging data collection and processing technologies (e.g. multimodal neuroimaging, wearable devices, and electronic health systems) and to reflect our expertise in providing and developing analytic tools for making sense of these complex data domains.

What do you enjoy doing in your time outside work? I have a 10 year old kid and as much as possible we try to explore the city, go to plays, events, and be outside in the parks including taking trips to the ocean.  Having spent all of my life before 2010 in the Midwest, I enjoy taking opportunities to go to the ocean which is now just a car ride away.

 

 

 

Posted in Uncategorized | Leave a comment

Survival analysis always makes me nervous

This post is heavy on questions, and not too many answers.  Whenever I’m faced with a data analysis that seems to warrant using a survival analysis model, I get a bit nervous.   There is something about modeling time-to-event data that feels prone to mistakes and it is probably because each modeling question seems to beget other fundamental questions.  Maybe because I have less experience with performing survival analysis than other types of  modeling problems, I have not built up a strong intuition regarding the relative importance of the different questions.  Here is a list of things I worry about:

  1. How should we deal with censoring? What do we even mean by censoring – is it left or right or interval? Is the censoring informative?  If so, how to deal with that?
  2. How should we deal with time? What do we even mean by time – is it time-on-study, time since birth? What if the predictors are changing over time too, does that change how we should think of time?
  3. How should we deal with the event of interest itself? What do we mean by the event itself – sometimes other events get in the way of our event, are they worse? irrelevant? important? how to deal with these so-called “competing risks”?
  4. Should we worry about modeling assumptions like proportional hazards or does it not really matter?

All of these questions have come up for me, and might inspire future posts, but today I want to talk about #2.  While working on two recent analyses of observational studies  looking at predictors of  mortality, the question #2 came up.  Both studies are cohort studies with wide age ranges at baseline.  Should we control for age or should we treat age as the event time?  Turns out this does not have a clear answer.

I found this 2009 paper in Statistics in Medicine by Chalis, Chicken, and McGee that examines this “Time Scales in Epidemiological Analysis. An Empirical Comparison” https://pdfs.semanticscholar.org/8f6e/77a32653bd82b355311a47ccde43c1360d66.pdf.  They compare three models: 1) time on study 2) age as time scale and 3) left truncated age time scale and showed that the time-on-study time scale and the age time scale models can result in significantly different coefficients.  This paper does not provide guidance on whether one is more “right” than the other.  They say in the conclusion: “

“The selection of the time scale in the proportional hazards model and the measurement of its appropriateness is not straightforward. Moreover, we note that the definition of a “good time scale” is obscure. One possible definition for a “good time scale” could be one that preserves most of the useful information contained in the available measure and captures the nature of the data that is being modeled. Therefore, there is still a need of further investigation to find out adequate time scale that can best explain the disease-risk association.”

This paper is from 2009 so my investigation will continue…maybe there is is clarity somewhere in the literature in the 10 years since then…if I figure it out, I’ll let you know.

Posted in Uncategorized | 1 Comment

Clinical Trials: Checking out what the FDA and EMA recommend about controlling for covariates

At work I recently finalized analyses for a clinical trial that had 4 arms (double blind with 3 dose levels and a placebo) testing a nutritional supplement which may be helpful for cognition.  There were measurements at baseline (before randomization) and 3 followup times with the 2nd followup being the primary endpoint of interest.  I recommended, and we performed, a mixed effects model of change scores from baseline also controlling for the baseline outcome (as in ANCOVA), and also controlling for gender, age, and education all known to be predictors of the primary cognition outcome.

I don’t analyze nearly as many clinical trials as I do observational studies.  I was curious whether the analyses I’ve done (in particular the control or not for baseline outcomes and other covariates) would pass the check boxes for an FDA or EMA submission, so I looked it up.

Apparently just this past April 2019, FDA put out guidance for how to deal with covariates in clinical trials.  This guidance document is still open for public feedback and apparently is not meant to be the legally enforceable rules but just the general principles. It included just 5 points, much shorter than I expected. Here are there 5 points and my reactions (in bold):

  1. Sponsors can use ANCOVA to adjust for differences between treatment groups in relevant baseline variables to improve the power of significance tests and the precision of estimates of treatment effect. (looks like controlling for baseline factors is basically recommended but without specific guidelines how to choose them.  I would say include ones that are predictive of outcomes- because they will help with precision- and include ones that show non-trivial imbalance at baseline- since they will help with bias).
  2. Sponsors should not use ANCOVA to adjust for variables that might be affected by treatment. (Basically do not control for anything you collect AFTER randomization)
  3. The sponsor should prospectively specify the covariates and the mathematical form of the model in the protocol or statistical analysis plan. When these specifications are unambiguous, FDA will not generally be concerned about the sensitivity of results to the choice of covariates because differences between adjusted estimators and unadjusted estimators of the same parameter, or between adjusted estimators using different models, are random. (Saying they are unconcerned is a bit strange, but I guess they mean as long as they are prespecified since they are assuming randomization will take care of everything being balanced, this has not always been my experience especially in smaller <50 per arm studies)
  4. Interaction of the treatment with covariates is important, but the presence of an interaction does not invalidate ANCOVA as a method of estimating and testing for an overall treatment effect, even if the interaction is not accounted for in the model. The prespecified primary model can include interaction terms if appropriate. However, interaction means that the treatment effect is different for different subjects, and this fact could be relevant to prescribers, patients, and other stakeholders. Therefore, even though a primary analysis showing an overall treatment effect remains valid, differential effects in subgroups can also be important. (Green light to look at subgroup analysis – but remember ATE average treatment effect is still meaningful)
  5. Many clinical trials use a change from baseline as the primary outcome measure. Even when the outcome is measured as a change from baseline, the baseline value can still be used advantageously as a covariates. (These 3 things need to be said – and learned – over and over again, 1) ANCOVA with change score as the outcome gives the exact same test for treatment as ANCOVA with post-test as the outcome 2) Analyzing the change score as the outcome without controlling for the baseline gives a different treatment effect estimate than ANCOVA whenever there is any imbalance in the baseline outcome by treatment (i.e. whenever the treatment assignment is correlated by chance with baseline outcome) 3) Many articles recommend always using ANCOVA in clinical trials as it is most likely to reduce bias due to imbalance and also will have more precision)

 

The EMA also has guidelines from 2015 and in my opinion they are much better, perhaps because they are finalized and are not still in the open public feedback stage, although I was surprised FDA didn’t already have something like this.  EMA has 15 guidelines following a very reasonable preamble.  My comments in bold:

Preamble from the EMA: Baseline covariates impact the outcome in many clinical trials. Although baseline adjustment is not always necessary, in case of a strong or moderate association between a baseline covariate(s) and the primary outcome measure, adjustment for such covariate(s) generally improves the efficiency of the analysis and avoids conditional bias from chance covariate imbalance.

Baseline covariates may be accounted for at the design stage of a clinical trial and/or in the statistical analysis. When dealing with baseline covariates the following recommendations are made:

  1. Stratification may be used to ensure balance of treatments across covariates; it may also be used for administrative reasons (e.g. block in the case of block randomisation). The factors that are the basis of stratification should normally be included as covariates or stratification variables in the primary outcome model, except where stratification was done purely for an administrative reason. (Recognizing the stratification variables are covariates is very sensible)
  2. Variables known a priori to be strongly, or at least moderately, associated with the primary outcome and/or variables for which there is a strong clinical rationale for such an association should also be considered as covariates in the primary analysis. The variables selected on this basis should be pre-specified in the protocol. (Same point made in the FDA list)
  3. Baseline imbalance observed post hoc should not be considered an appropriate reason for including a variable as a covariate in the primary analysis. However, conducting exploratory analyses including such variables when large baseline imbalances are observed might be helpful to assess the robustness of the primary analysis. (If there is a large imbalance on a baseline variable and it is also a strong predictor of the outcome, it seems to me it must be included as a covariate to avoid bias.  However, I guess if it is a strong predictor of the outcome you should have already known that and included it in #2)
  4. Variables measured after randomisation and so potentially affected by the treatment should not be included as covariates in the primary analysis. (Same point made in the FDA list)
  5. If a baseline value of a continuous primary outcome measure is available, then this should usually be included as a covariate. This applies whether the primary outcome variable is defined as the ‘raw outcome’ or as the ‘change from baseline’. (Yes, they are saying…use ANCOVA)
  6. Covariates to be included in the primary analysis must be pre-specified in the protocol. (This is the same thing they said above in #2)
  7. Only a few covariates should be included in a primary analysis. Although larger data sets may support more covariates than smaller ones, justification for including each of the covariates should be provided. (It would be helpful if this had more specifics…what does ‘a few’ mean and perhaps some comment about whether there are any concerns about misspecification)
  8. In the absence of prior knowledge, a simple functional form (usually either linearity or categorising a continuous scale) should be assumed for the relationship between a continuous covariate and the outcome variable. (good point)
  9. The validity of model assumptions must be checked when assessing the results. This is particularly important for generalised linear or non-linear models where mis-specification could lead to incorrect estimates of the treatment effect. Even under ordinary linear models, some attention should be paid to the possible influence of extreme outlying values. (Check for outliers, always)
  10. Whenever adjusted analyses are presented, results of the treatment effect in subgroups formed by the covariates (appropriately categorised, if relevant) should be presented to enable an assessment of the model assumptions. (I have never done this but I think it is a very useful idea and worth trying though may need to be careful with smaller sample sizes or groups)
  11. Sensitivity analyses should be pre-planned and presented to investigate the robustness of the primary analysis. Discrepancies should be discussed and explained. In the presence of important differences that cannot be logically explained – for example, between the results of adjusted and unadjusted analyses – the interpretation of the trial could be seriously affected. (I like the emphasis on looking at the data in different ways.)
  12. The primary model should not include treatment by covariate interactions. If substantial interactions are expected a priori, the trial should be designed to allow separate estimates of the treatment effects in specific subgroups (In contrast to the FDA, this is encouraging researchers to stay away from subgroup analyses, remember treatment by covariate interactions is how we test for treatment effects in subgroups)
  13. Exploratory analyses may be carried out to improve the understanding of covariates not included in the primary analysis, and to help the sponsor with the ongoing development of the drug. (Keep learning from the data)
  14. In case of missing values in baseline covariates the principles for dealing with missing values as outlined e.g. in the Guideline on missing data in confirmatory clinical trials (EMA/CPMP/EWP/1776/99 Rev. 1) applies. (Need to go read this and put up a blog post for it)
  15. A primary analysis, unambiguously pre-specified in the protocol, correctly carried out and interpreted, should support the conclusions which are drawn from the trial. Since there may be a number of alternative valid analyses, results based on pre-specified analyses will carry most credibility. (Lay out your plan beforehand thoughtfully and follow it is a good recipe for credible and reproducible science) 
Posted in Uncategorized | 1 Comment

Confidence intervals for a ratio when you only have summary estimates and standard errors

 

This problem may arise if you have prevalence estimates and associated standard errors for some dichotomous outcome (e.g. cannabis use) in two exposure groups (e.g. people with and without a painful condition) and you want to get the confidence interval for the Relative Risk but you don’t have access to the raw data, only the summary estimates. Indeed, just such problem came up recently in a project I was working on.

For example,

Prevalence of cannabis use for people with pain is a=5.15% with standard error, s.e.(a)= 0.39%

Prevalence of cannabis use for people without pain is b=3.74% with standard error, s.e.(b)= 0.14%

We want the 95% confidence interval for relative ratio RR = a/b = 5.15/3.74 = 1.377

The method (formula) is attributed to Fieller, EC. (1940) “The biological standardisation of insulin”. Journal of the Royal Statistical Society (Supplement). 1:1–54. I took a look at the 1940 paper and there are many many formulas but after about 30 minutes I gave up trying to find which one is for the ratio and instead found a simple version of the specific formulation here https://i.stack.imgur.com/vO8Ip.png

Fieller’s method incorporates the value g below and can only be used when g<1 which occurs when the s.e.(B) is smaller than half of B (i.e. the denominator of the ratio). The formula will yield non-symmetric confidence intervals.

I checked the confidence interval estimates from my code below against estimates obtained from an online calculator at https://www.graphpad.com/quickcalcs/errorProp1/?Format=SEM and they are very very similar (to 3 decimal places)

a=c(5.15)   ###Prevlance of cannabis use for people with pain
sea=c(.39)  ###standard error of prevalence

b=c(3.74)   ###Prevalence of cannabis use for people without pain
seb = c(.14) ####standard error of prevalence

q = a/b

### using simple delta method gives similar results in our case to the Fiellers method since seb is small compared to b
### here is the formulat but I leave it commented out since we might as well use the Fiellers method which is expected
### to be more accurate and can produce non-symmetric confidence intervals more appropriate for risk ratios
#seqdelta = q*sqrt(sea^2/a^2 + seb^2/b^2)
#q
#q-1.96*seqdelta
#q+1.96*seqdelta

###Fieller's formula for the standard error
### if g > 1 then do not use this formula it is not a good approximation

g=(1.96*seb/b)^2
if (g<1) {
q = a/b
serr = (q/(1-g))*sqrt((1-g)*sea^2/a^2 + seb^2/b^2)
}

This outputs the 95% confidence intervals for the risk ratio = 1.377

for RR = A/B = prevalence of cannabis use in the pain group / prevalence of cannabis use in the no pain group

a
## [1] 5.15
b
## [1] 3.74
###risk ratio
q
## [1] 1.377005
###left side of 95% confidence interval
q/(1-g) - 1.96*serr
## [1] 1.155729
###right side of 95% confidence interval
q/(1-g) + 1.96*serr
## [1] 1.613187

So the 95% confidence interval for RR=1.377 using Fieller’s method for approximating confidence intervals is (1.156, 1.613)

 

Posted in Uncategorized | Leave a comment