DATA606 - Intro to Data

February 6, 2019

Announcements

Next week’s meetup will be on Tuesday!
Lab 0 has been graded. Couple of general notes:
- There are questions you should answer before the “On your own” section. Keeping all the text is fine.
- Use the .Rmd file created with the startLab function.
- If you are using Rpubs (which I highly recommend), put the link in the submission text.
- Great start!
Installing LaTeX (it is probably worth the effort to do this sooner rather than later)
- Links are located here: https://data606.net/course-overview/software/
There is a general CUNY MSDS Slack channel click here to join it.

Presentations

Jagdish Chhabria (1.55)
Mia Chen
Isabel Ramesar (1.33): https://www.youtube.com/watch?v=GlZBmDxXFcs&feature=youtu.be

Sampling vs. Census

A census involves collecting data for the entire population of interest. This is problematic for several reasons, including:

It can be difficult to complete a census: there always seem to be some individuals who are hard to locate or hard to measure. And these difficult-to-find people may have certain characteristics that distinguish them from the rest of the population.
Populations rarely stand still. Even if you could take a census, the population changes constantly, so it’s never possible to get a perfect measure.
Taking a census may be more complex than sampling.

Sampling involves measuring a subset of the population of interest, usually randomly.

Sampling Bias

Non-response: If only a small fraction of the randomly sampled people choose to respond to a survey, the sample may no longer be representative of the population.
Voluntary response: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will also not be representative of the population.
Convenience sample: Individuals who are easily accessible are more likely to be included in the sample.

Observational Studies vs. Experiments

Observational study: Researchers collect data in a way that does not directly interfere with how the data arise, i.e. they merely “observe”, and can only establish an association between the explanatory and response variables.
Experiment: Researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables.

Source: XKCD 552 http://xkcd.com/552/

Correlation does not imply causation!

Simple Random Sampling

Randomly select cases from the population, where there is no implied connection between the points that are selected.

Simple Random Sample

Stratified Sampling

Strata are made up of similar observations. We take a simple random sample from each stratum.

Cluster Sampling

Clusters are usually not made up of homogeneous observations so we take random samples from random samples of clusters.

Principles of experimental design

Control: Compare treatment of interest to a control group.
Randomize: Randomly assign subjects to treatments, and randomly sample from the population whenever possible.
Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the entire study.
Block: If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups.

Difference between blocking and explanatory variables

Factors are conditions we can impose on the experimental units.
Blocking variables are characteristics that the experimental units come with, that we would like to control for.
Blocking is like stratifying, except used in experimental settings when randomly assigning, as opposed to when sampling.

More experimental design terminology…

Placebo: fake treatment, often used as the control group for medical studies
Placebo effect: experimental units showing improvement simply because they believe they are receiving a special treatment
Blinding: when experimental units do not know whether they are in the control or treatment group
Double-blind: when both the experimental units and the researchers who interact with the patients do not know who is in the control and who is in the treatment group

Random assignment vs. random sampling

Causality

Randomized Control Trials

`ggplot2`

ggplot2 is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics.
ggplot2 is, in general, more flexible for creating “prettier” and complex plots.
Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.) ggplot2 has at least three ways of creating plots:
1. qplot
2. ggplot(...) + geom_XXX(...) + ...
3. ggplot(...) + layer(...)
We will focus only on the second.

First Example

library(ggplot2)
data(diamonds)
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()

Parts of a `ggplot2` Statement

Data
ggplot(myDataFrame, aes(x=x, y=y)
Layers
geom_point(), geom_histogram()
Facets
facet_wrap(~ cut), facet_grid(~ cut)
Scales
scale_y_log10()
Other options
ggtitle('my title'), ylim(c(0, 10000)), xlab('x-axis label')

Lots of geoms

ls('package:ggplot2')[grep('geom_', ls('package:ggplot2'))]

##  [1] "geom_abline"          "geom_area"            "geom_bar"            
##  [4] "geom_bin2d"           "geom_blank"           "geom_boxplot"        
##  [7] "geom_col"             "geom_contour"         "geom_count"          
## [10] "geom_crossbar"        "geom_curve"           "geom_density"        
## [13] "geom_density_2d"      "geom_density2d"       "geom_dotplot"        
## [16] "geom_errorbar"        "geom_errorbarh"       "geom_freqpoly"       
## [19] "geom_hex"             "geom_histogram"       "geom_hline"          
## [22] "geom_jitter"          "geom_label"           "geom_line"           
## [25] "geom_linerange"       "geom_map"             "geom_path"           
## [28] "geom_point"           "geom_pointrange"      "geom_polygon"        
## [31] "geom_qq"              "geom_qq_line"         "geom_quantile"       
## [34] "geom_raster"          "geom_rect"            "geom_ribbon"         
## [37] "geom_rug"             "geom_segment"         "geom_sf"             
## [40] "geom_sf_label"        "geom_sf_text"         "geom_smooth"         
## [43] "geom_spoke"           "geom_step"            "geom_text"           
## [46] "geom_tile"            "geom_violin"          "geom_vline"          
## [49] "update_geom_defaults"

Scatterplot Revisited

ggplot(legosets, aes(x=Pieces, y=USD_MSRP)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, color=Availability)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures, color=Availability)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures)) + geom_point() + facet_wrap(~ Availability)

Boxplots Revisited

ggplot(legosets, aes(x='Lego', y=USD_MSRP)) + geom_boxplot()

Boxplots Revisited (cont.)

ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot()

Boxplots Revisited (cont.)

ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot() + coord_flip()

Likert Scales

Likert scales are a type of questionaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).

library(likert)
library(reshape)
data(pisaitems)
items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q']
items24 <- rename(items24, c(
            ST24Q01="I read only if I have to.",
            ST24Q02="Reading is one of my favorite hobbies.",
            ST24Q03="I like talking about books with other people.",
            ST24Q04="I find it hard to finish books.",
            ST24Q05="I feel happy if I receive a book as a present.",
            ST24Q06="For me, reading is a waste of time.",
            ST24Q07="I enjoy going to a bookstore or a library.",
            ST24Q08="I read only to get information that I need.",
            ST24Q09="I cannot sit still and read for more than a few minutes.",
            ST24Q10="I like to express my opinions about books I have read.",
            ST24Q11="I like to exchange books with my friends."))

`likert` R Package

l24 <- likert(items24)
summary(l24)

##                                                        Item      low
## 10   I like to express my opinions about books I have read. 41.07516
## 5            I feel happy if I receive a book as a present. 46.93475
## 8               I read only to get information that I need. 50.39874
## 7                I enjoy going to a bookstore or a library. 51.21231
## 3             I like talking about books with other people. 54.99129
## 11                I like to exchange books with my friends. 55.54115
## 2                    Reading is one of my favorite hobbies. 56.64470
## 1                                 I read only if I have to. 58.72868
## 4                           I find it hard to finish books. 65.35125
## 9  I cannot sit still and read for more than a few minutes. 76.24524
## 6                       For me, reading is a waste of time. 82.88729
##    neutral     high     mean        sd
## 10       0 58.92484 2.604913 0.9009968
## 5        0 53.06525 2.466751 0.9446590
## 8        0 49.60126 2.484616 0.9089688
## 7        0 48.78769 2.428508 0.9164136
## 3        0 45.00871 2.328049 0.9090326
## 11       0 44.45885 2.343193 0.9609234
## 2        0 43.35530 2.344530 0.9277495
## 1        0 41.27132 2.291811 0.9369023
## 4        0 34.64875 2.178299 0.8991628
## 9        0 23.75476 1.974736 0.8793028
## 6        0 17.11271 1.810093 0.8611554

`likert` Plots

plot(l24)

`likert` Plots

plot(l24, type='heat')

`likert` Plots

plot(l24, type='density')

Dual Scales

Some problems¹:

The designer has to make choices about scales and this can have a big impact on the viewer
“Cross-over points” where one series cross another are results of the design choices, not intrinsic to the data, and viewers (particularly unsophisticated viewers)
They make it easier to lazily associate correlation with causation, not taking into account autocorrelation and other time-series issues
Because of the issues above, in malicious hands they make it possible to deliberately mislead

library(DATA606)
shiny_demo('DualScales', package='DATA606')

My advise:

Avoid using them. You can usually do better with other plot types.
When necessary (or compelled) to use them, rescale (using z-scores)

¹ http://blog.revolutionanalytics.com/2016/08/dual-axis-time-series.html ² http://ellisp.github.io/blog/2016/08/18/dualaxes

Announcements

Presentations

Sampling vs. Census

Sampling Bias

Observational Studies vs. Experiments

Simple Random Sampling

Stratified Sampling

Cluster Sampling

Principles of experimental design

More experimental design terminology…

Random assignment vs. random sampling

Causality

Randomized Control Trials

ggplot2

First Example

Parts of a ggplot2 Statement

Lots of geoms

Scatterplot Revisited

Scatterplot Revisited (cont.)

Scatterplot Revisited (cont.)

Scatterplot Revisited (cont.)

Boxplots Revisited

Boxplots Revisited (cont.)

Boxplots Revisited (cont.)

Likert Scales

likert R Package

likert Plots

likert Plots

likert Plots

Dual Scales

`ggplot2`

Parts of a `ggplot2` Statement

`likert` R Package

`likert` Plots

`likert` Plots

`likert` Plots