Machine learning for psychiatric research:  
Current reality and future prospects

The psychiatry department at Columbia University and the New York State Psychiatric Institute (which I am a part of) invited me to give a presentation associated with a new award set-up by Dr. Susan Essock for Mental Health Services Research. I put a lot of effort into trying to organize how I see the role of machine learning within scientific research in psychiatry. It’s an exciting time to be a data scientist.

Posted in Uncategorized | Leave a comment

Overview of Neuroimaging data modalities

This year at the JSM in Vancouver I was asked to be a discussant in a session on statistical methods for high dimensional Neuroimaging markers for mental health disorders.  Given the diversity of the 4 talks, rather than discuss the different statistical methods, I took the opportunity to pull together in one place a basic description of the different Neuroimaging modalities that were used in the talks: EEG, MRI, fMRI, dMRI, MRS.

Posted in Uncategorized | Leave a comment

Finally started learning the TIDYVERSE in R!!!

This past week, Monday- Friday for 2 hours every day, my colleague Jeff Goldsmith held a CRASH COURSE in R TIDYVERSE for ~10 faculty in Mental Health Data Science and Biostatistics who are all (cough) over 30 years old (ok maybe over 50).  Some of us on the upper tail of that age distribution remember when R began in the 90’s and was the next best thing after S+, but that is ancient history.

Jeff went through most of his semester course notes http://www.p8105.com for us.  Just for posterity I list here all the things I had no clue about before Monday and now think I will be using as much as possible, also some analogies to SAS which helped me understand better:

  1. RStudio (like the SAS editor, log and output window, but fancier)
  2. R Markdown documentation (maybe I will eventually relent and agree to use html with colleagues) and “knitting”
  3. Tibbles (the new name for data frames), read_csv (instead of read.csv, oh so easy), haven::read_sas
  4. How to manipulate datasets (“tibbles”) using the tidyverse (which is the name of the ‘revolutionary’ R package which has transformed data analysis pipelines with R and embraces the reproducible transparent science movement) using: select (to choose variables), filter (to choose rows), mutate (to derive new variables), arrange (to sort).  This set of tools are what I see as being the major advances in R that now makes it at the same level or surpassing  SAS for manipulating data (something I used to believe R was not at all good for and probably the main limitation that kept me from using it)
  5. Using the pipe %>% rather than nested functions so that code is cleaner and more transparent
  6. Merging data using full_join() and left_join(), I was shocked when I found out we don’t have to sort the data first
  7. How to transpose from wide to long and vice versa using gather () and spread(). Found that this was as non-intuitive as SAS Proc Transpose, but not worse.
  8. Using the broom::tidy() to create tibbles from model output.  This is like the ODS in SAS where everything that runs out of a model can be used as a dataset and thus #4 applied.
  9. GGPLOT finally I can make sense of the code (I had seen it in the past with the + signs everywhere and couldn’t understand)…aes() sets up the overall aesthetics including what will be on the x vs y axis and what to plot “by” (below name is the grouping variable indicating which park the data is from) and geom_point tells ggplot you want a scatterplot (alpha is the transparency of the points), geom_smooth is obvious but (aes(color = NULL) in this layer tells it to forget the original colors), facet_grid is how to get the three different plots rather than all of them overlayed.
weather %>%
  ggplot(aes(x = tmin, y = tmax, color=name)) + 
  geom_point(alpha = .2) +
  geom_smooth(aes(color = NULL),method = "loess", color="Black") + 
  facet_grid(. ~ name)

  1. How to reorder factors (notice it is an anagram of factors)
    ##forcats::fct_reorder()

  2. how to use the group_by command to do data manipulation within groups and also sending to the summarize command.  When you use the group_by command in defining a new tibble, it “sticks” to the tibble as meta information until you “ungroup” it later.

  3. What Git and GitHub are though I can’t say I fully figured out how to set it up yet

  4. And a whole lot more that I’m sure I don’t remember but now know where to go to find out 🙂

My goal is that I will be making posts soon using R markdown 🙂

Posted in Uncategorized | Leave a comment

Symposium for Big Brain Data at NYSPI

NYSPI held the first Symposium on Big Brain Data on May 1 2018.  Very well attended (over 40 local researchers), the consensus was we need to keep up the momentum of collaborating across groups and work towards a centralized Data storage and Processing and Innovation core for the efficient and reproducible advancement of  psychiatric research using MRI and other high dimensional big data technologies.

 

List of speakers

Posted in Uncategorized | Leave a comment

Data Science premieres at Elementary School Science Expo

Over the weekend some biostatistics and statistics colleagues of mine at Columbia University (Ying Wei and Ming Yuan) and LSE (Irini Moustaki) took on the challenge of making Data Science fun for Elementary School Children (K-6).  Our kids’ school (The School at Columbia) hosts a Science Expo every other year where they invite mostly parent scientists to create exhibit rooms (in classrooms throughout the school) and the school becomes a science museum for a day.

We had:

1) Games of chance with a dice game and a Galton board

2) Data visualization https://www.showmeshiny.com

3)  Chernoff faces  as a descriptive tool for summarizing/clustering similar multivariate data.  We collected survey data on questions kids developed (e.g. what subject do you like in school, do you prefer white or chocolate milk) and then summarized and clustered it.

4) A Mock Clinical Trial.  The results of the trial confirmed that having kids read about eating healthy fruits does not make them any less likely to choose candy!

 

Posted in Uncategorized | Leave a comment

p-hacking and the danger zone of knowing just enough statistics to be dangerous

The danger zone of data science is the intersection of content knowledge with computer skills good enough to run statistical “data science” software without the statistical knowledge/principles to know when to stop hunting.

https://www.the-scientist.com/?articles.view/articleNo/51920/title/Emails-Reveal-Questionable-Practices-by-Cornell-Food-Scientist-and-His-Coauthors/

 

 

 

Posted in Uncategorized | 1 Comment

Deep learning for correlates of psychiatric disorders – can the algorithm do better than a human

Today I read this article Using deep learning to investigate the neuroimaging correlates of psychiatric and neurological disorders: Methods and applications by Vieria, Pinaya, and Mechelli in Neuroscience & Biobehavioral Reviews  https://www.sciencedirect.com/science/article/pii/S0149763416305176#tbl0005, I think it is a very well written overview and review of deep learning for neuroimaging data.
 
I was surprised to learn the seemingly high level of accuracy that prediction models for mild cognitive impairment and Alzheimers Disease (MCI/AD) vs Healthy Controls have been achieving using just structural imaging data.  This is reviewed nicely in the article. But then I remembered that AD is at least partially diagnosed by a human visually reading a structural MRI.  It would be useful to compare how accurately a well-trained human would be at diagnosing AD vs Healthy control from looking visually just at an individual patient’s structural MRI to the machine learning algorithm.  I want to get a sense of whether we are yet creating learning algorithms that are BETTER than humans for diagnosing AD, or else just automating what a human could do.
 
In other types of imaging, e.g. mammography, if we look at the success so far of deep learning algorithms  (https://www.ncbi.nlm.nih.gov/pubmed/28212138 Deep Learning in Mammography: Diagnostic Accuracy of a Multipurpose Image Analysis Software in the Detection of Breast Cancer, it appears they are only now reaching the level of “as good as a human”…this mammography article concludes  “Current state-of-the-art artificial neural networks for general image analysis are able to detect cancer in mammographies with similar accuracy to radiologists, even in a screening-like cohort with low breast cancer prevalence”.  Similarly, in the corporate world, the great successes of deep learning so far are all in *automation*, i.e. getting a computer to do something just as well as a human can do (e.g. speech recognition, reading an image to determine whether it is a Starbucks or a laundry mat, or detecting spam emails).   
 
At least one of the big hurdles with psychiatric disorders  (I think) is that up to now even well-trained humans haven’t been able to look at a brain scan and be able to label one person schizophrenic or ADHD or OCD or healthy, instead we have to talk to the persons about their symptoms and/or observe their behaviors.  So while there seems to be some apparent success (the Neuroscience & Biobehavioral Reviews article mentions some success for ADHD and SZ though I’m skeptical about replication) I think we are still a far way off from having deep learning algorithms be able to do something which a human cannot do which is to look at a brain scan and diagnose a psychiatric disorder.  Nevertheless we will keep trying and certainly will learn a lot about the brain along the way.
Posted in Uncategorized | Leave a comment

Data Science in biostatistics academic curriculums

In the Department of Biostatistics at Columbia University this 2017-18 school year my colleagues (Jeff Goldsmith and Yifei Sun) have started a sequence on Data Science aimed at our Masters students. It has been very popular and I think this is terrific we are incorporating this sea shift.  I have encouraged biostatisticians who work in our team Mental Health Data Science to sit in the courses

COURSE DESCRIPTION – Data Science 1

Contemporary biostatistics and data analysis depends on the mastery of tools for computation, visualization, dissemination, and reproducibility in addition to proficiency in traditional statistical techniques. The goal of this course is to provide training in the elements of a complete pipeline for data analysis. It is targeted to MS, MPH, and PhD students with some data analysis experience.

Students who successfully complete this course will:

  • Integrate the principles of data organization into their analyses;
  • Easily produce static and interactive graphics;
  • Implement analyses in a reproducible way;
  • Use Github to publish and disseminate analyses;
  • Develop usable software packages in R;
  • Collect data from online sources using web-scraping.

RECOMMENDED REFERENCES (note: there are no required texts for this course)

The Internet (stackoverflow; google; blog posts; twitter)
R for Data Science by G. Grolemund and H. Wickham
Exploratory Data Analysis with R by R Peng
R Programming for Data Science by R Peng.
R Packages by H. Wickham
Advanced R by H. Wickham

COURSE DESCRIPTION – Data Science 2

With the explosion of “Big Data” problems, statistical learning has become a very hot field in many scientific areas. The goal of this course is to provide the training in practical statistical learning. It is targeted to MS students with some data analysis experience.

Students who successfully complete this course will be able to:

  • Understand basic concepts and methods in statistical learning
  • Apply classification and regression techniques beyond linear regression
  • Conduct exploratory data analysis using methods in unsupervised learning
  • Implement various statistical learning methods using software package

RECOMMENDED REFERENCES

[ISL] An Introduction to Statistical Learning with Applications in R (main reference) by G James et al.

[EDA] Exploratory Data Analysis with R by R Peng

R Programming for Data Science by R Peng

I also found The American Statistical Association publication describing what they think should be being taught.

 

Posted in Uncategorized | Leave a comment

The misguided goal of figuring out if A causes B or the reverse

Just read a review essay on Causality and Statistical Learning from 2011 in American Journal of Sociology by Andy Gelman (who btw encouraged me several years ago to start my own blog).  He reviewed 3 books on Causality by Morgan and Winship, by Pearl, and by Sloman.  I am thinking to make the review essay on the reading list for my course on Latent Variable and Structural Equation Modeling (SEM) for Health Science.  Andy’s summary of the difficult problems of causal inference is very useful.  I agree with the way he lays out an ordering of views/attitudes on causal reasoning from conservative to permissive, with most traditional users of SEM being on the permissive side.  But, what I really liked was the following critique he made about Sloman’s book which reminds me how important it is to keep asking my colleagues: why do we even care if A is a cause of B or which direction the cause is?  Of course, much of time there are good reasons to care (i.e. a different intervention or policy might be implemented if we thought knew something about the causal direction) but not always,  and so it is a good question to ask to hopefully steer the analytic goals in a different direction.

p.963 “The place where I think Sloman is misguided is in his formulation of scientific models in an either/or way, as if, in truth, social variables are linked in simple causal paths, with a scientific goal of figuring out if A causes B or the reverse. I don’t know much about intelligence, beer consumption, and socioeconomic status, but I certainly don’t see any simple relationships between income, religious attendance, party identification, and voting—and I don’t see how a search for such a pattern will advance our understanding, at least given current techniques. I’d rather start with description and then go toward causality following the approach of economists and statisticians by thinking about potential interventions one at a time.”

Review Essay Causality and Statistical Learning byAndrew Gelman
Counterfactuals and Causal Inference: Methods and Principles for Social Research. By Stephen L. Morgan and Christopher Winship. New York: Cambridge University Press, 2007. Pp. xiii 319.
Causality: Models, Reasoning, and Inference, 2d ed. By Judea Pearl. Cambridge: Cam- bridge University Press, 2009. Pp. xix 464.
Causal Models: How People Think About the World and Its Alternatives. By Steven A. Sloman. Oxford: Oxford University Press, 2005. Pp. xi 212.
America Journal of Sociology 117 955-966. 2011.
Posted in Uncategorized | Leave a comment

To explain or predict

Just discovered the work “To explain or predict” of Galit Shmueli (http://www.galitshmueli.com/content/explain-or-predict). She excellently clarifies the concepts of “explanatory modeling” and “predictive modeling” which mean different things to statisticians, social scientists, computer scientists and other scientists.

My own work as a biostatistician has mostly focused on explanatory modeling (i.e. trying to test whether some exposure affects some outcome and if so by how much, and in what contexts in order to explain the underlying mechanism).   The primary focus of machine learning methods has been predictive modeling or classification modeling (i.e. trying to determine whether a user will click on an advertisement or to determine whether a particular image is a cat or is it a human).

When we also add in descriptive modeling, I think one useful way of describing Data Science is that it encompasses all the analytic tools (both quantitative as well as qualitative methods my colleague Rogerio Pinto would remind me) for describing, explaining, and predicting phenomena.

“To explain or predict” https://www.youtube.com/watch?v=vWH_HNfQVRI

Posted in Uncategorized | 2 Comments