# Chapter 1

# Chapter 1 - Introduction to Data

## Learning Objectives

- Identify the type of variables (e.g. numerical or categorical; discrite or continuous; ordered or not ordered).
- Identify the relationship between multiple variables (i.e. indepedent vs. dependent).
- Define variables that are not associated as independent.
- Be able to describe and identify the difference between observational and experimental studies.
- Distinguish between simple random, stratified, and cluster sampling, and recognize the benefits and drawbacks of choosing one sampling scheme over another.
- Identify the four principles of experimental design and recognize their purposes: control any possible con- founders, randomize into treatment and control groups, replicate by using a sufficiently large sample or repeat- ing the experiment, and block any variables that might influence the response.
- Use appropriate visualizations for different types of data (e.g. histogram, barplot, scatterplot, boxplot, etc.).
- Use different measures of center and spread and be able to describe the robustness of different statistics.
- Describe the shape of distributions vis-à-vis histograms and boxplots.
- Create and intepret contingency and frequency tables (one- and two-way tables).

## Assignments

### Homework

*OpenIntro Statistics*

- Practice: 1.7 (available in R using the
`data(iris)`

command), 1.9, 1.23, 1.33, 1.55, 1.69 - Graded: 1.8, 1.10, 1.28, 1.36, 1.48, 1.50, 1.56, 1.70

For 1.48, the following R code will create a vector `scores`

that can be used to answer the question:

```
scores <- c(57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83, 88, 89, 94)
```

### Lab 1

The labs are available in the `DATA606`

R package. To start the first lab, use the `startLab`

function. This will copy the lab to your current working directory and rename the file according to your computer username (as returend by `Sys.info()['user']`

). If this is incorrect, then either provide the `file-prefix`

parameter to `startLab`

, or rename the file after it has been copied.

```
startLab('Lab1')
```

## Readings

- OpenIntro Statistics slides
- ggplot2 -
`ggplot2`

is an R package by Wickham that implements the grammer of graphics (Wilkinson, 2005) in R. I will frequently make use of the graphing framework throughout the course and is worth learning. - Visualizing Likert Data - An R package for visualizing Likert scale data built on the
`ggplot2`

framework. - Quick-R base graphics - Covers many of the visualizations using R’s base graphics.

## Videos

*OpenIntro* provides a number of videos. You may find these helpful while reading the chapter.

Case Study: Using Stents to Prevent Strokes

Data Basics: Observations, Variable, and Data Matrices

Observational Studies and Sampling Strategies

Summarizing and Graphing Numerical Data

Using Randomization to Analyze a Gender Discrimination Study

## Note about Pie Charts

There is only one pie chart in *OpenIntro Statistics* (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

However, consider the bar plot below. Here, we cleary see there is a difference between the ratio of the three colors. As John Tukey famously said:

There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart

Source: https://en.wikipedia.org/wiki/Pie_chart.