Activity Analysis – Fitbit Data Set

Objectives

This is a college assignment, and the data  can be found here –  https://github.com/rizkashifs/kashif-general/blob/master/fitbit_activity_data.

This primary objective of this project to refresh some basics as follows:

1.Basic R Concepts

2.Reading A File / Writing A File

3.File Manipulation

4.Basic Statistical Functions

5.Presentation Of Report As Literate Program

Introduction

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

Load Data

Firstly, we load the data set, and view the first 6 rows –


# Set Directory
setwd("~/Downloads")

# Load 
data <- read.csv("activity.csv" , header = TRUE)

# Print Header
head (data)

 

## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25

 

The variables included in this dataset are:

1.Steps – Number of steps taking in a 5-minute interval (missing values as NA)

2.date – The date on which the measurement was taken in YYYY-MM-DD format

3. Interval – Identifier for the 5-minute interval in which measurement was taken.

Total of 17,568 observations in this dataset.

PART 1 – Missing Data Analysis

  1. Calculate and report the total number records with missing values in the dataset (i.e. the total number of rows with NAs)

This can be most easily done with the summary () function.


summary (data)

## steps               date                  interval
## Min. : 0.00               2012-10-01:        288           Min. : 0.0
## 1st Qu.: 0.00           2012-10-02:          288      1st Qu.: 588.8
## Median : 0.00      2012-10-03: 288                 Median :1177.5
## Mean : 37.38         2012-10-04: 288                     Mean :1177.5
## 3rd Qu.: 12.00       2012-10-05: 288                  3rd Qu.:1766.2
## Max. :806.00            2012-10-06: 288                   Max. :2355.0
## NA’s :2304 (Other) :15840

 

We see that there are 2304 NA’s in the steps column!

 

2.Calculate and report the total number records with missing values per day (i.e. the total number of rows with NAs)

I’m working on easier and alternative ways to calculate this. However, the aggregate function as shown below was specifically made for such tasks.

# We use aggregate function for this 

calc_na <- aggregate(steps ~ date, data=data, function(x) {sum(is.na(x))}, na.action =NULL)

# This creates a data frame which is stored as calc_na. This contains all days with corresponding missing values

head (calc_na)

 ##         date                     steps
## 1        2012-10-01        288
## 2       2012-10-02           0
## 3       2012-10-03           0
## 4       2012-10-04           0
## 5       2012-10-05           0
## 6       2012-10-06          0

 

PART 2 – Data Analysis

 

  1. What is number of valid records across all days?

 

Method 1 –

We calculated NA’s earlier. We can tweak the same command with ! (not) operator to get the opposite of NA, which are valid values.

# same function with not (for aggregation)
calc_rec <- aggregate(steps ~ date, data=data, function(x) {sum(!is.na(x))}, na.action =NULL)

# We can see the data frame

head (calc_rec)

This will then give the sum


sum (calc_rec$steps)

 

Method 2 –

However, easiest way to do this is as follows:


sum (!is.na(data$steps))

 

2. What is total number of steps taken across all days?


# sum function is used. na.rm = TRUE (hence NA values are ignored)
sum (data$steps, na.rm = TRUE)

 

 

3. What is average number of steps taken across all days?


# mean function is used. na.rm = TRUE (hence NA values are ignored)
mean (data$steps, na.rm = TRUE)

 

4. Which 5-minute interval, across all the days contains the maximum number of steps?


# max function is used. na.rm = TRUE (hence NA values are ignored)
max_row <- max (data$steps, na.rm = TRUE)


#this returns the Row number of the highest number of steps taken interval

# we subset this to get the entire observation out from the data frame

data[ max_row , ]

 

 

5. Which 5-minute interval, across all the days contains the minimum number of steps?

Exactly same steps as above. Only difference being min function will be used instead of max.


# max function is used. na.rm = TRUE (hence NA values are ignored)
min_row <- min (data$steps, na.rm = TRUE)


#this returns the Row number of the highest number of steps taken interval

# we subset this to get the entire observation out from the data frame

data[ min_row , ]

 

 

6. What is number of valid records per day

# We use aggregate function for this 

valid_day <- aggregate(steps ~ date, data=data, function(x) {sum(!is.na(x))}, na.action =NULL)

# This creates a data frame which is stored. This contains all days with corresponding count of valid records

head (valid_day)

 

7. What is total number of steps taken per day?


sum_day <- aggregate(steps ~ date, data=data, sum, na.action =NULL)

# This creates a data frame which is stored. This contains all days with corresponding count of valid records

head (sum_day)

 

 

8. What is average number of steps taken per day?

avg_day <- aggregate(steps ~ date, data=data, mean, na.action =NULL)

This creates a data frame which is stored. 
This contains all days with corresponding count of valid records

head (avg_day)

 

9. On which day the individual walked the most?

# first we'll aggregate sum for each day as we did earlier
sum_day <- aggregate(steps ~ date, data=data, sum, na.action =NULL)

# now we extract the row with largest value for step

head (sum_day)

Extract for largest aggregate of steps


# use which.max function for this
sum_day[ which.max(sum_day$steps) , ]

 

 

10. On which day the individual walked the least?

We repeat similar steps for min.

# first we'll aggregate sum for each day as we did earlier sum_day <- aggregate(steps ~ date, data=data, sum, na.action =NULL) # now we extract the row with least value for step # using which.min () function sum_day[ which.min(sum_day$steps) , ]

 

‘Plyr’ package

Instead of using aggregate function, an easier alternative is to use the ddply() function from the plyr package. Here is an example:

require ('plyr')
d <- ddply (data , .(date), summarise, 
                                  count= sum(steps, na.rm=TRUE),  
                                  mean =mean(steps, na.rm=TRUE) ,
                                  missingCount= sum(is.na(steps)))


fix (d)

date               count      mean     missingCount
1 2012-10-01   0            NaN               288
2 2012-10-02   126       0.43750        0
3 2012-10-03   11352    39.41667      0
4 2012-10-04   12116    42.06944    0
5 2012-10-05   13294    46.15972     0
6 2012-10-06   15420   53.54167      0

 

Data Imputation Strategy

We firstly do some basic analysis while ignoring the NA or missing values, let’s see what we observe.

 

Do we see a trend according to the day of the week?

# Convert to date format the 'date' column
data$date <- as.Date(data$date , "%Y-%m-%d")


## Add day of the week as a column
data$day <- weekdays(data$date)


## Aggregate by day
sum_day <- aggregate(steps ~ day, data=data, sum, na.action = na.omit)

sum_day
# no particular trend is observed, Wednesday has highest number of steps. 

day                steps
1 Friday            86518
2 Monday       69824
3 Saturday      87748
4 Sunday        85944
5 Thursday     65702
6 Tuesday       80546
7 Wednesday 94326

 

What if we see the mean steps for each day?

sum_day <- aggregate(steps ~ day, data=data, mean, na.action = na.omit)

sum_day

day                steps
1 Friday           42.91567
2 Monday       34.63492
3 Saturday     43.52579
4 Sunday        42.63095
5 Thursday     28.51649
6 Tuesday        31.07485
7 Wednesday 40.94010

 

Lets reorder from Monday to Sunday..

reorder <- sum_day [ c(2,6,7,5,1,3,4),]

Now we can plot the trend of average steps walked each day..

 

# Function to plot a time series plot.ts()
plot.ts (reorder$steps,xaxt = "n", xlab = "Day of the Week")

# Rename the labels for x-axis
axis(1, at=1:7, labels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

weekday.png

The trend observed is that the average number of steps on the weekends including Friday is around 45. Weekly drop is on Tuesday and Thursday, with Thursday being the lowest.

 

Data Imputation

The simplest way to do data analysis is to ignore the missing values completely.  However, this is quite naive, as we do not know what the missing values hold in store, and they could be an important part of the analysis in forming the trend or giving some insights.

There are many ways for data imputation, simplest being replacing the value with the mean.

In our case, we could use mean of that particular day, or mean of that particular day of the week. For example, if we observe that the average on Fridays is 45 steps, if there is a missing value for Friday, we can impute it with 45.

## Impute with Group means
## Basically, we want to see the mean for each day, then if a value is missing,
## we will replace with mean of that particular day

#First convert day variable to a factor
data$day <- as.factor(data$day)

# now impute by Group means
data$steps[is.na(data$steps)] <- ave(data$steps, 
 data$day, 
 FUN=function(x)mean(x, 
 na.rm = T))[is.na(data$steps)] 

# print head
head(data)

steps           date                  interval         day
1 34.63492   2012-10-01      0              Monday
2 34.63492   2012-10-01      5              Monday
3 34.63492   2012-10-01      10            Monday
4 34.63492   2012-10-01      15            Monday
5 34.63492   2012-10-01      20           Monday
6 34.63492   2012-10-01       25           Monday

Hence, the steps with NA have been updated.

 

Let’s repeat the daily trend and see if it makes any difference in the analysis.

 

weekday2.png

Imputing with mean of weekday for missing values, in this case shows no particular change in trend.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s