The psychiatry department at Columbia University and the New York State Psychiatric Institute (which I am a part of) invited me to give a presentation associated with a new award set-up by Dr. Susan Essock for Mental Health Services Research. I put a lot of effort into trying to organize how I see the role of machine learning within scientific research in psychiatry. It’s an exciting time to be a data scientist.
Machine learning for psychiatric research: Current reality and future prospects
Overview of Neuroimaging data modalities
This year at the JSM in Vancouver I was asked to be a discussant in a session on statistical methods for high dimensional Neuroimaging markers for mental health disorders. Given the diversity of the 4 talks, rather than discuss the different statistical methods, I took the opportunity to pull together in one place a basic description of the different Neuroimaging modalities that were used in the talks: EEG, MRI, fMRI, dMRI, MRS.
Finally started learning the TIDYVERSE in R!!!
This past week, Monday- Friday for 2 hours every day, my colleague Jeff Goldsmith held a CRASH COURSE in R TIDYVERSE for ~10 faculty in Mental Health Data Science and Biostatistics who are all (cough) over 30 years old (ok maybe over 50). Some of us on the upper tail of that age distribution remember when R began in the 90’s and was the next best thing after S+, but that is ancient history.
Jeff went through most of his semester course notes http://www.p8105.com for us. Just for posterity I list here all the things I had no clue about before Monday and now think I will be using as much as possible, also some analogies to SAS which helped me understand better:
- RStudio (like the SAS editor, log and output window, but fancier)
- R Markdown documentation (maybe I will eventually relent and agree to use html with colleagues) and “knitting”
- Tibbles (the new name for data frames), read_csv (instead of read.csv, oh so easy), haven::read_sas
- How to manipulate datasets (“tibbles”) using the tidyverse (which is the name of the ‘revolutionary’ R package which has transformed data analysis pipelines with R and embraces the reproducible transparent science movement) using: select (to choose variables), filter (to choose rows), mutate (to derive new variables), arrange (to sort). This set of tools are what I see as being the major advances in R that now makes it at the same level or surpassing SAS for manipulating data (something I used to believe R was not at all good for and probably the main limitation that kept me from using it)
- Using the pipe %>% rather than nested functions so that code is cleaner and more transparent
- Merging data using full_join() and left_join(), I was shocked when I found out we don’t have to sort the data first
- How to transpose from wide to long and vice versa using gather () and spread(). Found that this was as non-intuitive as SAS Proc Transpose, but not worse.
- Using the broom::tidy() to create tibbles from model output. This is like the ODS in SAS where everything that runs out of a model can be used as a dataset and thus #4 applied.
- GGPLOT finally I can make sense of the code (I had seen it in the past with the + signs everywhere and couldn’t understand)…aes() sets up the overall aesthetics including what will be on the x vs y axis and what to plot “by” (below name is the grouping variable indicating which park the data is from) and geom_point tells ggplot you want a scatterplot (alpha is the transparency of the points), geom_smooth is obvious but (aes(color = NULL) in this layer tells it to forget the original colors), facet_grid is how to get the three different plots rather than all of them overlayed.
weather %>%
ggplot(aes(x = tmin, y = tmax, color=name)) +
geom_point(alpha = .2) +
geom_smooth(aes(color = NULL),method = "loess", color="Black") +
facet_grid(. ~ name)
- How to reorder factors (notice it is an anagram of factors)
##forcats::fct_reorder() -
how to use the group_by command to do data manipulation within groups and also sending to the summarize command. When you use the group_by command in defining a new tibble, it “sticks” to the tibble as meta information until you “ungroup” it later.
-
What Git and GitHub are though I can’t say I fully figured out how to set it up yet
-
And a whole lot more that I’m sure I don’t remember but now know where to go to find out 🙂
My goal is that I will be making posts soon using R markdown 🙂
Symposium for Big Brain Data at NYSPI
NYSPI held the first Symposium on Big Brain Data on May 1 2018. Very well attended (over 40 local researchers), the consensus was we need to keep up the momentum of collaborating across groups and work towards a centralized Data storage and Processing and Innovation core for the efficient and reproducible advancement of psychiatric research using MRI and other high dimensional big data technologies.
List of speakers
Data Science premieres at Elementary School Science Expo
Over the weekend some biostatistics and statistics colleagues of mine at Columbia University (Ying Wei and Ming Yuan) and LSE (Irini Moustaki) took on the challenge of making Data Science fun for Elementary School Children (K-6). Our kids’ school (The School at Columbia) hosts a Science Expo every other year where they invite mostly parent scientists to create exhibit rooms (in classrooms throughout the school) and the school becomes a science museum for a day.
We had:
1) Games of chance with a dice game and a Galton board
2) Data visualization https://www.showmeshiny.com
3) Chernoff faces as a descriptive tool for summarizing/clustering similar multivariate data. We collected survey data on questions kids developed (e.g. what subject do you like in school, do you prefer white or chocolate milk) and then summarized and clustered it.
4) A Mock Clinical Trial. The results of the trial confirmed that having kids read about eating healthy fruits does not make them any less likely to choose candy!
p-hacking and the danger zone of knowing just enough statistics to be dangerous
The danger zone of data science is the intersection of content knowledge with computer skills good enough to run statistical “data science” software without the statistical knowledge/principles to know when to stop hunting.
Deep learning for correlates of psychiatric disorders – can the algorithm do better than a human
Data Science in biostatistics academic curriculums
In the Department of Biostatistics at Columbia University this 2017-18 school year my colleagues (Jeff Goldsmith and Yifei Sun) have started a sequence on Data Science aimed at our Masters students. It has been very popular and I think this is terrific we are incorporating this sea shift. I have encouraged biostatisticians who work in our team Mental Health Data Science to sit in the courses
COURSE DESCRIPTION – Data Science 1
Contemporary biostatistics and data analysis depends on the mastery of tools for computation, visualization, dissemination, and reproducibility in addition to proficiency in traditional statistical techniques. The goal of this course is to provide training in the elements of a complete pipeline for data analysis. It is targeted to MS, MPH, and PhD students with some data analysis experience.
Students who successfully complete this course will:
- Integrate the principles of data organization into their analyses;
- Easily produce static and interactive graphics;
- Implement analyses in a reproducible way;
- Use Github to publish and disseminate analyses;
- Develop usable software packages in R;
- Collect data from online sources using web-scraping.
RECOMMENDED REFERENCES (note: there are no required texts for this course)
The Internet (stackoverflow; google; blog posts; twitter)
R for Data Science by G. Grolemund and H. Wickham
Exploratory Data Analysis with R by R Peng
R Programming for Data Science by R Peng.
R Packages by H. Wickham
Advanced R by H. Wickham
COURSE DESCRIPTION – Data Science 2
With the explosion of “Big Data” problems, statistical learning has become a very hot field in many scientific areas. The goal of this course is to provide the training in practical statistical learning. It is targeted to MS students with some data analysis experience.
Students who successfully complete this course will be able to:
- Understand basic concepts and methods in statistical learning
- Apply classification and regression techniques beyond linear regression
- Conduct exploratory data analysis using methods in unsupervised learning
- Implement various statistical learning methods using software package
RECOMMENDED REFERENCES
[ISL] An Introduction to Statistical Learning with Applications in R (main reference) by G James et al.
[EDA] Exploratory Data Analysis with R by R Peng
R Programming for Data Science by R Peng
I also found The American Statistical Association publication describing what they think should be being taught.
The misguided goal of figuring out if A causes B or the reverse
Just read a review essay on Causality and Statistical Learning from 2011 in American Journal of Sociology by Andy Gelman (who btw encouraged me several years ago to start my own blog). He reviewed 3 books on Causality by Morgan and Winship, by Pearl, and by Sloman. I am thinking to make the review essay on the reading list for my course on Latent Variable and Structural Equation Modeling (SEM) for Health Science. Andy’s summary of the difficult problems of causal inference is very useful. I agree with the way he lays out an ordering of views/attitudes on causal reasoning from conservative to permissive, with most traditional users of SEM being on the permissive side. But, what I really liked was the following critique he made about Sloman’s book which reminds me how important it is to keep asking my colleagues: why do we even care if A is a cause of B or which direction the cause is? Of course, much of time there are good reasons to care (i.e. a different intervention or policy might be implemented if we thought knew something about the causal direction) but not always, and so it is a good question to ask to hopefully steer the analytic goals in a different direction.
p.963 “The place where I think Sloman is misguided is in his formulation of scientific models in an either/or way, as if, in truth, social variables are linked in simple causal paths, with a scientific goal of figuring out if A causes B or the reverse. I don’t know much about intelligence, beer consumption, and socioeconomic status, but I certainly don’t see any simple relationships between income, religious attendance, party identification, and voting—and I don’t see how a search for such a pattern will advance our understanding, at least given current techniques. I’d rather start with description and then go toward causality following the approach of economists and statisticians by thinking about potential interventions one at a time.”
Review Essay Causality and Statistical Learning byAndrew Gelman
Counterfactuals and Causal Inference: Methods and Principles for Social Research. By Stephen L. Morgan and Christopher Winship. New York: Cambridge University Press, 2007. Pp. xiii 319.
Causality: Models, Reasoning, and Inference, 2d ed. By Judea Pearl. Cambridge: Cam- bridge University Press, 2009. Pp. xix 464.
Causal Models: How People Think About the World and Its Alternatives. By Steven A. Sloman. Oxford: Oxford University Press, 2005. Pp. xi 212.
America Journal of Sociology 117 955-966. 2011.
To explain or predict
Just discovered the work “To explain or predict” of Galit Shmueli (http://www.galitshmueli.com/content/explain-or-predict). She excellently clarifies the concepts of “explanatory modeling” and “predictive modeling” which mean different things to statisticians, social scientists, computer scientists and other scientists.
My own work as a biostatistician has mostly focused on explanatory modeling (i.e. trying to test whether some exposure affects some outcome and if so by how much, and in what contexts in order to explain the underlying mechanism). The primary focus of machine learning methods has been predictive modeling or classification modeling (i.e. trying to determine whether a user will click on an advertisement or to determine whether a particular image is a cat or is it a human).
When we also add in descriptive modeling, I think one useful way of describing Data Science is that it encompasses all the analytic tools (both quantitative as well as qualitative methods my colleague Rogerio Pinto would remind me) for describing, explaining, and predicting phenomena.
“To explain or predict” https://www.youtube.com/watch?v=vWH_HNfQVRI