by Kristin Sainani
CONTENTS
1. Descriptive statistics and looking at data
2. Review of study designs; Measures of disease risk and association
3. Probability, Bayes' Rule, Diagnostic Testing
4. Probability distributions
5. Statistical Inference
6. P-values (errors, statistical power, and pitfalls)
7. Statistical Tests
8. Regression Analysis
9. Logistic Regression, Cox Regression
1. Descriptive statistics and looking at data
1.1 Types of Data
1.1.1 Quantitative Variable
It is a numerical data(e.g., Age, Blood pressure, BMI, Pulse) that you can add, subtract, multiply, and divide.
ㆍContinuous (quantitative) variable: can theoretically take on any value within a given range (e.g., height=68.99955... inches)
ㆍDiscrete (quantitative) variable: can only take on certain values (e.g., count data)
However, In the real world, sometimes the distinction between continuous and discrete actually doesn't make much difference. For example, when we analyze a family size from discrete value(e.g., 2, 3 or more), there exist in average 2.5 which is exactly continuous. In order to analyze(or average) data correctly, we need to represent range (continuous) even using discrete variables.
1.1.2 Categorical Variable
There no exist the meaning of number inherently. It represents numbering to describe separation.
ㆍBinary: two categories (e.g., Dead/alive, Treatment/placebo, Disease/no disease, etc)
ㆍNominal: unordered categories (e.g., The blood type (O, A, B, AB), Marital status, Occupation)
ㆍOrdinal: ordered categories (e.g., Staging in breast cancer as 1,2,3 or 4, Birth order -1st, 2nd, 3rd, etc, Age in categories (10-20, 20-30, etc))
Ordinal variable is useful to represent when we want to divide the continuous variable into multiple separations because we often don't want to analyze all the range of continuous variable (for computation issue, objective study, etc).
1.1.3 Time-to-event Varaible
It's a Hybrid variable which has a continuous part (time) and a binary or discrete part (event: yes/no). For example, Time to death, Time to heart attack, etc.
It's only encountered in studies that follow participants over time - such as cohort studies and randomized trials. (e.g., How long until people in your study have heart attack?, How long do people survive in your study?)
We'd better always plot data before analyzing it because we could know
- Are there "outliers"? which can be very influential on some of the statistics
- Are there data points that don't make sense? (data entry error)
- How are the data distributed? What the shape of the distribution? What's the center? What's the variation?
Bar Chart for Categorical variables |
Bar Chart is used for categorical variables to show frequency or proportion in each category.
Box Plot for quantitative variables |
1.3 Describing Quantitative Data: Where is the center?
Two measures(e.g., Mean & Median) of "central tendency".
1.3.1 Mean
= the average; the balancing; the sum of values divided by the sample size
ㆍProperty: The mean is affected by extreme values:
Even though we have only one extreme value, it influences huge impact on the mean. Above picture represents a huge change of the mean when we get only one extreme value.
(Interestingly) A binary variable also have a mean.
This average is just percentage of people who answered "Yes".
1.3.2 Median
= the exact middle value (50%)
Two cases of calculation!
If there are an odd number of observations, directly find the middle value.
If there are an even number of observations, find the middle two values and average them.
ㆍProperty: The median is NOT affected by extreme values:
Since the extreme value does not effect the median, the both median values are the same. Thus, the median is not effected by distance from median to values, but only effected by the mass.
(Interestingly) A binary variable also have a median.
In this Varsity Sports Bar Chart, if you line up the 0's and 1's, the middle number is 1. Thus, the median will occur at '1'.
Mean vs. Median
For skewed data, the median is preferred because the mean can be highly misleading.(The mean is affected by the extreme values)
For example, there is a hypothetical example. 10 dieters following diet 1 vs. 10 dieters following diet 2.
In conclusion, do we convince that diet 1 is better ?
Let's look at Histograms.
They are very different in median value each other because of the extreme value. Therefore, we need to compare medians (ranked data) rather than means by using a "non-parametric test"(also known as the Mann-Whitney U test) as follows:
As a result, when we compare distributions, we need to adapt the appropriate statistic measure.
1.4 Describing Quantitative Data: What is the variability in the data?
Measures of Variability
(Range(the spread of the data), Standard deviation/Variance, Percentiles, Inter-quartile range(IQR))
1.4.1 Range
Difference between the largest and the smallest observations
Motivation of Variance and Standard Deviation
Challenge: devise a statistic that gives the average distance from the mean
When we devise just like this, naive average distance from the mean
But, this won't work because the distance can be negative and also positive.
How can we get rid of negatives? Absolute values? No, it's too messy mathemetically
We'll do squaring eliminates negatives.
1.4.2 Variance
Average squared distance from the mean:
When we calculate variance or standard deviation, we have to calculate the mean first. By calculating the mean first, we loose a "degree of freedom". So, we only have n-1.
For example, In a problem of averaging 100 people's ages, once you've calculated the mean, you only have 99 degrees of freedom left because that last person, 100 person is fixed. (It can't be different number.) So, you've lost a "degree of freedom".
1.4.3 Standard Deviation
Roughly, the average spread around the mean:
When representing data, it's better to use Standard deviation rather than Variance.
Because of the squaring, values(e.g., 38) farther from the mean contribute more to the standard deviation than values closer to the mean. That is, the standard deviation is affected by extreme values.
Therefore, the standard deviation is not exactly the average difference from the mean. It represents sort of the average spread or scatter.
(Interestingly) A binary variable have a standard deviation
Understanding Standard Deviation:
Standard deviations vs. standard errors
- Standard deviation measures the variability of a trait
- Standard error measures the variability of a statistic, which is a theoretical construct
+ Statistics actually follow distributions. So, statistics have a shape of the distribution and they have a variability called a standard error.
+ Different statistics have different standard errors.
+ You may have standard error of mean, odds ratio, regression coefficient.
1.4.4 Percentiles
Based on ranking the data
- the 90th percentile is the value for which 90% of observations are lower
- the 50th percentile is the median
- the 10th percentile is the value for which 10% of observations are lower
+ Percentiles are not effected by extreme values (unlike standard deviations)
1.4.5 Interquartile Range (IQR)
- Interquartile range = 3rd quartile - 1st quartile
- the middle 50% of the data
+ interquartile range is not affected by outliers
1.5 Exploring real data: Lead in lipstick
News Headlines
"Lipsticks Contain Excessive Lead, Tests Reveal"
"One third of lipsticks on the market contain high lead"
"400 shades of lipstick found to contain lead"
"What's in Your Lipstick? FDA Finds Lead in 400 Shades"
2007 report by a consumer advocacy group
"One-third of the lipsticks tested contained an amount of lead that exceeded the U.S. Food and Drug Administration's 0.1 ppm limit for lead in candy - a standard established to protect children from ingesting lead."
1 ppm = 1 part per million = 1 microgram/gram
Is the real serious problem ? How worried should women be?
- What is the dose of lead in lipstick?
- How much lipstick are women exposed to?
- How much lipstick do women ingest?
Let's explore the data to check whether this news is real or not.
FDA 2009 (N-22) vs. FDA 2012(n=400) |
Distribution of lead in lipstick (n=400 samples, FDA 2012) |
How much lipstick are women using on a daily basis?
- The histogram looks "right skewed"
- 1 in 30,000 women uses 218 milligrams of lipstick per day. (1 tube of lipstick contains 4000 milligrams.)
- the heaviest user goes through an entire tube of lipstick in 18 days. (4000mg/tube ÷ 218 mg/day = 18 days per tube)
Assuming that women ingest 50% of the lipstick they apply daily
What is the typical lead exposure to lipstick for women, in micrograms of lead, based on medians?
- Daily exposure: 0.89 mcg/g x 17.11 mg x 1 g/1000 mg = 0.0152 mcg
- Daily ingestion: 0.0152 mcg/2 = 0.0076 mcg
What is the highest daily lead exposure to lipstick for women, in mcg of lead?
- Daily exposure: 7.19 mcg/g x 217.53 mg x 1 g/1000 mg = 1.56 mcg
- Daily ingestion: 1.56 mcg/2 = 0.78 mcg
To put these numbers in perspective:
- "Provisional tolerable daily intake" for and adult is 75 micrograms/day, which is a safe amount
- 0.0076 mcg / 75 mcg = 0.02% of your PTDI
- 0.78 mcg / 75 mcg = 1% of your PTDI (1 in 12 million women)
- Interestingly, Average American consumes 1 to 4 mcg of lead per day from food alone.
In the case of chocolate...
Median level of lead in milk chocolate = 0.016 mcg/g (FDA limit = 0.1 mcg/g)
0.016 mcg/g (chocolate median) << 0.89 mcg/g (lipstick median) << 7.19 mcg/g (lipstick maximum)
Exposure from 1 chocolate bar: 0.016 mcg/g x 43 g = 0.69 mcg
Typical daily exposure from chocolate: 0.016 mcg/g x 13.7 g = 0.22 mcg
Conclusion
Typical daily exposure from chocolate (0.22 mcg) is 29 times higher than the typical exposure from lipstick (0.0076 mcg) and extreme daily exposure to lead from lipstick (0.78 mcg) is similar to exposure from daily consumption of an average chocolate bar (0.69 mcg)
2. Review of study designs; Measures of disease risk and association
2.1 Quick review of study designs
Why there exist various study designs?
First, We need to interpret the data in various perspectives.
Second, there are different levels of evidence for the hypotheses we're testing
- Observational Study is where the investigator just observe the participants. That is, they don't intervene in any way.
- Experimental Study is where the investigator actually intervene in some way.
- Randomized controlled trial is considered the gold standard of study design, but it's not always possible to do a randomized trial.
Limitation of observational research: Confounding
The reason why the observational studies have a lower level of evidence for hypothesis. Confounding is that risk factors don't happen in isolation, except in a controlled experiment. For example, Alcohol and lung cancer are correlated, but this is only because drinkers also tend to be smokers. It turns out that alcohol doesn't biologically cause lung cancer. If you're just observing participants in a study, it's very hard to tease out the single risk factor from among the many risk factors that tend to go together. In this case, the apparent association between Alcohol and lung cancer is really just due to confounding by smoking. Another example, in observational study where just to look at whether or not how much red meat you eat is correlated to mortality, the problem of how do we isolate the effect of read meat eating on health or mortality is difficult because of confounding. In other words, risk factors cluster.
Ways to avoid or control for confounding
ㆍDuring the design phase: randomize (randomized trial: removing order) or match (case-control study: using real case)
ㆍIn the analysis phase: use multivariate regression to statistically "adjust for" confounders. However, statistical adjustment is not a panacea. you can't control for all confounders and there is always "residual" confounding.
2.1.1 Cross-sectional (prevalence) Study
Measure prevalence of disease and exposure on a random sample of the population of interest at one time point.
+ Cheap and easy (In general, by conducting a survey)
- Correlation does not imply causation
- Cannot determine what came first
- Confounding
Example - Relationship between atherosclerosis and late-life depression
Researchers measured the prevalence of coronary artery calcification (atherosclerosis) and the prevalence of depressive symptoms in a large cohort of elderly men and women in Rotterdam (n=1920) (random sample)
They are looking at everything at one time point and they just wondered whether those two things are correlated. Even if the artery blockage and the depression are correlated, they can't know which one is the first. (Correlation does not imply causation)
2.1.2 Case-Control Study
Sample on disease status and ask retrospectively about exposures.
+ Efficient for rare disease and outbreak situations (In cross-sectional study it's hard to find due to using random sample)
- Getting appropriate controls is tricky
- Recall bias (due to dependence only on memory)
- Confounding
- The risk factor may have come after the disease (We don't know for sure whether or not the risk factor preceded the disease or vice versa because you're measuring things after the disease has already occurred.)
⑴ Form a case group
⑵ Find a set of control group who don't have the disease
⑶ Figure out what's different in the risk factors between the cases and controls
⑷ Ask retrospectively about their risk factors and exposures
Example - Early case-control studies among AIDS cases and matched controls, indicated that AIDS was transmitted by sexual contact or blood products before we knew this was a virus.
Researchers found that the cases were much more likely to have used a particular type of drug than the controls. Now, it turns out the drug actually wasn't a causative agent in AIDS, so this is actually an example of confounding. But people who used this drug, also tended to be doing the other high risk behaviors that were associated with coming down with AIDS.
2.1.3 Prospective Cohort Study
Measure risk factors on people who are disease-free at baseline; then follow them over time and calculate risks or rates of developing disease.
+ Exposures are measured prior to outcomes
+ Can study multiple outcomes (piggy-back)
- Time and money!
- Confounding (because it's still observational)
- Loss to follow-up
Prospective Cohort Study has some major advantages over cross-sectional and case-control studies. What you do is to measure their risk factors and exposures before they develop diseases or outcomes.
Example - The Framingham Heart Study
The Framingham Heart Study enrolled 5209 residents of Framingham, MA, aged 28 to 62 years, in 1948. Researchers measured their health and lifestyle factors (blood pressure, weight, exercise, etc.). Then, they followed them for decades to determine the occurrence of heart disease. The study continues today, tracking the kids and grandkids of the original cohort.
Reference
[1] Statistics in Medicine, Stanford University, Kristin Sainani, Stanford Online
댓글
댓글 쓰기