Deal with missing data

Identify missing data with is.na()

Find out which and how many observations have missing data

In R, missing values are coded as NA (not available)

[1] NA
[1] 10 20 NA 40 50 NA
[1] 6

Identify missing values with is.na()

[1] FALSE FALSE  TRUE FALSE FALSE  TRUE
[1] TRUE
[1] 3 6
[1] 2

Identify missing data in matrices

     [,1] [,2] [,3] [,4] [,5] [,6]
vec1   10   20   NA   40   50   NA
vec2   NA    7    9   10    1    8
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
vec1 FALSE FALSE  TRUE FALSE FALSE  TRUE
vec2  TRUE FALSE FALSE FALSE FALSE FALSE

Behavior of NA in different situations

Missing data in index vectors

[1] "A" NA  "C"

Missing data in factors

[1] A    <NA> C   
Levels: A C
[1] A    <NA> C   
Levels: A C <NA>

Missing data in logical expressions

[1] NA
[1] TRUE
[1] FALSE  TRUE FALSE    NA FALSE  TRUE
[1]  2 NA  5
[1] 2 5

Code missing values as NA

When data is entered in other applications (spreadsheets, SPSS, etc.), missing values are often coded as a reserved numeric value, e.g., 99 or 9999. These values need to be replaced with NA.

In vectors

[1] 30 25 23 21 NA NA

In matrices

     [,1] [,2] [,3]
[1,]   30   23 -999
[2,]   25   21  999
     [,1] [,2] [,3]
[1,]   30   23   NA
[2,]   25   21   NA

Statistical analysis with missing data

In vectors

[1] NA
[1] -0.6
[1] 4.615192
[1] -3

In matrices

     ageNA DV1 DV2
[1,]    18  NA   9
[2,]    NA   1   4
[3,]    27   5   2
[4,]    22  -3   7
[1]        NA        NA 11.333333  8.666667
[1] 13.500000  2.500000 11.333333  8.666667

Casewise deletion of missing data

[1]  TRUE  TRUE FALSE FALSE
     ageNA DV1 DV2
[1,]    27   5   2
[2,]    22  -3   7
     ageNA DV1 DV2
[1,]    27   5   2
[2,]    22  -3   7
attr(,"na.action")
[1] 2 1
attr(,"class")
[1] "omit"
ageNA   DV1   DV2 
 24.5   1.0   4.5 
      ageNA DV1   DV2
ageNA  12.5  20 -12.5
DV1    20.0  32 -20.0
DV2   -12.5 -20  12.5
[1] TRUE

Set casewise deletion as a permanent option for statistical functions (another choice is "na.fail")

Pairwise deletion of missing data

[1]        NA        NA 11.333333  8.666667
[1] 26.5 23.0
          ageNA DV1        DV2
ageNA  20.33333  20 -16.000000
DV1    20.00000  16 -10.000000
DV2   -16.00000 -10   9.666667

Further resources

Useful packages

Multiple imputation is supported by functions in packages mice and Amelia.

Get the article source from GitHub

R markdown - markdown - R code - all posts