January 30, 2019

Announcements

Intro to Data

We will use the lego R package in this class which contains information about every Lego set manufactured from 1970 to 2014, a total of 5710 sets.

devtools::install_github("seankross/lego")
library(lego)
data(legosets)

Types of Variables

  • Numerical (quantitative)
    • Continuous
    • Discrete
  • Categorical (qualitative)
    • Regular categorical
    • Ordinal

Data Types in R

Types of Variables

str(legosets)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6172 obs. of  14 variables:
##  $ Item_Number : chr  "10246" "10247" "10248" "10249" ...
##  $ Name        : chr  "Detective's Office" "Ferris Wheel" "Ferrari F40" "Toy Shop" ...
##  $ Year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ Theme       : chr  "Advanced Models" "Advanced Models" "Advanced Models" "Advanced Models" ...
##  $ Subtheme    : chr  "Modular Buildings" "Fairground" "Vehicles" "Winter Village" ...
##  $ Pieces      : int  2262 2464 1158 898 13 39 32 105 13 11 ...
##  $ Minifigures : int  6 10 NA NA 1 2 2 3 2 2 ...
##  $ Image_URL   : chr  "http://images.brickset.com/sets/images/10246-1.jpg" "http://images.brickset.com/sets/images/10247-1.jpg" "http://images.brickset.com/sets/images/10248-1.jpg" "http://images.brickset.com/sets/images/10249-1.jpg" ...
##  $ GBP_MSRP    : num  132.99 149.99 69.99 59.99 9.99 ...
##  $ USD_MSRP    : num  159.99 199.99 99.99 79.99 9.99 ...
##  $ CAD_MSRP    : num  200 230 120 NA 13 ...
##  $ EUR_MSRP    : num  149.99 179.99 89.99 69.99 9.99 ...
##  $ Packaging   : chr  "Box" "Box" "Box" "Box" ...
##  $ Availability: chr  "Retail - limited" "Retail - limited" "LEGO exclusive" "LEGO exclusive" ...

Qualitative Variables

Descriptive statistics:

  • Contingency Tables
  • Proportional Tables

Plot types:

  • Bar plot

Contingency Tables

table(legosets$Availability, useNA='ifany')
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##                   695                     2                  1795 
##           Promotional Promotional (Airline)                Retail 
##                   141                    12                  3120 
##      Retail - limited               Unknown 
##                   403                     4
table(legosets$Availability, legosets$Packaging, useNA='ifany')
##                        
##                         Blister pack  Box Box with backing card Bucket
##   LEGO exclusive                  45  147                     0      1
##   LEGOLAND exclusive               0    2                     0      0
##   Not specified                    0   20                     0      0
##   Promotional                      0   44                     0      0
##   Promotional (Airline)            0   11                     0      0
##   Retail                          53 2575                    16     30
##   Retail - limited                 2  302                     1      5
##   Unknown                          0    1                     0      0
##                        
##                         Canister Foil pack Loose Parts Not specified Other
##   LEGO exclusive               0         0          71             7     5
##   LEGOLAND exclusive           0         0           0             0     0
##   Not specified                0         5           0          1739     0
##   Promotional                  0         0           1             0     3
##   Promotional (Airline)        0         0           0             1     0
##   Retail                      78       285           0             0    28
##   Retail - limited             0         1           0             0     0
##   Unknown                      0         0           0             0     0
##                        
##                         Plastic box Polybag Shrink-wrapped  Tag  Tub
##   LEGO exclusive                  1     412              0    6    0
##   LEGOLAND exclusive              0       0              0    0    0
##   Not specified                   6      24              0    0    1
##   Promotional                     2      90              0    0    1
##   Promotional (Airline)           0       0              0    0    0
##   Retail                          0       4             18    0   33
##   Retail - limited                1      86              0    0    5
##   Unknown                         0       3              0    0    0

Proportional Tables

prop.table(table(legosets$Availability))
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##          0.1126053143          0.0003240441          0.2908295528 
##           Promotional Promotional (Airline)                Retail 
##          0.0228451069          0.0019442644          0.5055087492 
##      Retail - limited               Unknown 
##          0.0652948801          0.0006480881

Bar Plots

barplot(table(legosets$Availability), las=3)

Bar Plots

barplot(prop.table(table(legosets$Availability)), las=3)

Quantitative Variables

Descriptive statistics:

  • Mean
  • Median
  • Quartiles
  • Variance: \({ s }^{ 2 }=\sum _{ i=1 }^{ n }{ \frac { { \left( { x }_{ i }-\bar { x } \right) }^{ 2 } }{ n-1 } }\)
  • Standard deviation: \(s=\sqrt{s^2}\)

Plot types:

  • Dot plots
  • Histograms
  • Density plots
  • Box plots
  • Scatterplots

Measures of Center

mean(legosets$Pieces, na.rm=TRUE)
## [1] 215.1686
median(legosets$Pieces, na.rm=TRUE)
## [1] 82

Measures of Spread

var(legosets$Pieces, na.rm=TRUE)
## [1] 126876.8
sqrt(var(legosets$Pieces, na.rm=TRUE))
## [1] 356.1976
sd(legosets$Pieces, na.rm=TRUE)
## [1] 356.1976


fivenum(legosets$Pieces, na.rm=TRUE)
## [1]    0.0   30.0   82.0  256.5 5922.0
IQR(legosets$Pieces, na.rm=TRUE)
## [1] 226.25

The summary Function

summary(legosets$Pieces)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    30.0    82.0   215.2   256.2  5922.0     112

The psych Package

library(psych)
describe(legosets$Pieces, skew=FALSE)
##    vars    n   mean    sd min  max range   se
## X1    1 6060 215.17 356.2   0 5922  5922 4.58
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
##     item                group1 vars    n      mean        sd min  max
## X11    1        LEGO exclusive    1  659 172.74203 442.96954   1 3428
## X12    2    LEGOLAND exclusive    1    2 211.00000 154.14928 102  320
## X13    3         Not specified    1 1747 145.87178 309.19929   1 5195
## X14    4           Promotional    1  140  53.97143 108.42721   1 1000
## X15    5 Promotional (Airline)    1   12 126.16667  47.01612  10  203
## X16    6                Retail    1 3094 245.78119 294.78052   0 3803
## X17    7      Retail - limited    1  402 410.94030 652.06435   1 5922
## X18    8               Unknown    1    4  27.50000  15.96872   6   44
##     range         se
## X11  3427  17.255643
## X12   218 109.000000
## X13  5194   7.397620
## X14   999   9.163772
## X15   193  13.572384
## X16  3803   5.299546
## X17  5921  32.522014
## X18    38   7.984360

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

  • for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
  • for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

Dot Plot

stripchart(legosets$Pieces)

Dot Plot

par.orig <- par(mar=c(1,10,1,1))
stripchart(legosets$Pieces ~ legosets$Availability, las=1)

par(par.orig)

Histograms

hist(legosets$Pieces)

Transformations

With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.

hist(log(legosets$Pieces))

Density Plots

plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')

Density Plot (log tansformed)

plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')

Box Plots

boxplot(legosets$Pieces)

boxplot(log(legosets$Pieces))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn

Scatter Plots

plot(legosets$Pieces, legosets$USD_MSRP)

Examining Possible Outliers (expensive sets)

legosets[which(legosets$USD_MSRP >= 400),]
##      Item_Number                                   Name Year        Theme
## 901      2000430             Identity and Landscape Kit 2013 Serious Play
## 902      2000431                        Connections Kit 2013 Serious Play
## 2050     2000409                 Window Exploration Bag 2010 Serious Play
## 2852       10179 Ultimate Collector's Millennium Falcon 2007    Star Wars
##                       Subtheme Pieces Minifigures
## 901                                NA           6
## 902                              2455          NA
## 2050                             4900          NA
## 2852 Ultimate Collector Series   5195           5
##                                                 Image_URL GBP_MSRP
## 901  http://images.brickset.com/sets/images/2000430-1.jpg   509.99
## 902  http://images.brickset.com/sets/images/2000431-1.jpg   490.18
## 2050 http://images.brickset.com/sets/images/2000409-1.jpg   314.99
## 2852   http://images.brickset.com/sets/images/10179-1.jpg   342.49
##      USD_MSRP CAD_MSRP EUR_MSRP     Packaging  Availability
## 901    789.99   789.99   699.99 Not specified Not specified
## 902    754.99   754.99   559.99 Not specified Not specified
## 2050   484.99   484.99   359.99 Not specified Not specified
## 2852   499.99       NA       NA Not specified Not specified

Examining Possible Outliers (big sets)

legosets[which(legosets$Pieces >= 4000),]
##      Item_Number                                   Name Year
## 2047       10214                           Tower Bridge 2010
## 2050     2000409                 Window Exploration Bag 2010
## 2628       10189                              Taj Mahal 2008
## 2852       10179 Ultimate Collector's Millennium Falcon 2007
##                Theme                  Subtheme Pieces Minifigures
## 2047 Advanced Models                 Buildings   4287          NA
## 2050    Serious Play                             4900          NA
## 2628 Advanced Models                 Buildings   5922          NA
## 2852       Star Wars Ultimate Collector Series   5195           5
##                                                 Image_URL GBP_MSRP
## 2047   http://images.brickset.com/sets/images/10214-1.jpg   209.99
## 2050 http://images.brickset.com/sets/images/2000409-1.jpg   314.99
## 2628   http://images.brickset.com/sets/images/10189-1.jpg   199.99
## 2852   http://images.brickset.com/sets/images/10179-1.jpg   342.49
##      USD_MSRP CAD_MSRP EUR_MSRP     Packaging     Availability
## 2047   239.99   299.99   219.99           Box Retail - limited
## 2050   484.99   484.99   359.99 Not specified    Not specified
## 2628   299.99   399.99       NA           Box Retail - limited
## 2852   499.99       NA       NA Not specified    Not specified

plot(legosets$Pieces, legosets$USD_MSRP)
bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),]
text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Source: https://en.wikipedia.org/wiki/Pie_chart.

Just say NO to pie charts!

“There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart”

John Tukey