## Objectives

This is a college assignment, and the data can be found here – https://github.com/rizkashifs/kashif-general/blob/master/fitbit_activity_data.

This primary objective of this project to refresh some basics as follows:

1.Basic R Concepts

2.Reading A File / Writing A File

3.File Manipulation

4.Basic Statistical Functions

5.Presentation Of Report As Literate Program

## Introduction

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

## Load Data

Firstly, we load the data set, and view the first 6 rows –

```
# Set Directory
setwd("~/Downloads")
# Load
data <- read.csv("activity.csv" , header = TRUE)
# Print Header
head (data)
```

## steps date interval

## 1 NA 2012-10-01 0

## 2 NA 2012-10-01 5

## 3 NA 2012-10-01 10

## 4 NA 2012-10-01 15

## 5 NA 2012-10-01 20

## 6 NA 2012-10-01 25

The variables included in this dataset are:

1.Steps – Number of steps taking in a 5-minute interval (missing values as NA)

2.date – The date on which the measurement was taken in YYYY-MM-DD format

3. Interval – Identifier for the 5-minute interval in which measurement was taken.

Total of 17,568 observations in this dataset.

## PART 1 – Missing Data Analysis

**Calculate and report the total number records with missing values in the dataset (i.e. the total number of rows with NAs)**

This can be most easily done with the summary () function.

```
summary (data)
```

## steps date interval

## Min. : 0.00 2012-10-01: 288 Min. : 0.0

## 1st Qu.: 0.00 2012-10-02: 288 1st Qu.: 588.8

## Median : 0.00 2012-10-03: 288 Median :1177.5

## Mean : 37.38 2012-10-04: 288 Mean :1177.5

## 3rd Qu.: 12.00 2012-10-05: 288 3rd Qu.:1766.2

## Max. :806.00 2012-10-06: 288 Max. :2355.0

## NA’s :2304 (Other) :15840

We see that there are 2304 NA’s in the steps column!

2.**Calculate and report the total number records with missing values per day (i.e. the total number of rows with NAs)**

I’m working on easier and alternative ways to calculate this. However, the aggregate function as shown below was specifically made for such tasks.

```
# We use aggregate function for this
calc_na <- aggregate(steps ~ date, data=data, function(x) {sum(is.na(x))}, na.action =NULL)
# This creates a data frame which is stored as calc_na. This contains all days with corresponding missing values
head (calc_na)
```

## date steps

## 1 2012-10-01 288

## 2 2012-10-02 0

## 3 2012-10-03 0

## 4 2012-10-04 0

## 5 2012-10-05 0

## 6 2012-10-06 0

## PART 2 – Data Analysis

**What is number of valid records across all days?**

Method 1 –

We calculated NA’s earlier. We can tweak the same command with ! (not) operator to get the opposite of NA, which are valid values.

```
# same function with not (for aggregation)
calc_rec <- aggregate(steps ~ date, data=data, function(x) {sum(!is.na(x))}, na.action =NULL)
# We can see the data frame
head (calc_rec)
```

This will then give the sum

```
sum (calc_rec$steps)
```

Method 2 –

However, easiest way to do this is as follows:

```
sum (!is.na(data$steps))
```

2. **What is total number of steps taken across all days?**

```
# sum function is used. na.rm = TRUE (hence NA values are ignored)
sum (data$steps, na.rm = TRUE)
```

3. **What is average number of steps taken across all days**?

```
# mean function is used. na.rm = TRUE (hence NA values are ignored)
mean (data$steps, na.rm = TRUE)
```

4**. Which 5-minute interval, across all the days contains the maximum number of steps?**

```
# max function is used. na.rm = TRUE (hence NA values are ignored)
max_row <- max (data$steps, na.rm = TRUE)
#this returns the Row number of the highest number of steps taken interval
# we subset this to get the entire observation out from the data frame
data[ max_row , ]
```

**5. Which 5-minute interval, across all the days contains the minimum number of steps?**

Exactly same steps as above. Only difference being min function will be used instead of max.

```
# max function is used. na.rm = TRUE (hence NA values are ignored)
min_row <- min (data$steps, na.rm = TRUE)
#this returns the Row number of the highest number of steps taken interval
# we subset this to get the entire observation out from the data frame
data[ min_row , ]
```

**6. What is number of valid records per day**

```
# We use aggregate function for this
valid_day <- aggregate(steps ~ date, data=data, function(x) {sum(!is.na(x))}, na.action =NULL)
# This creates a data frame which is stored. This contains all days with corresponding count of valid records
head (valid_day)
```

**7. What is total number of steps taken per day?**

```
sum_day <- aggregate(steps ~ date, data=data, sum, na.action =NULL)
# This creates a data frame which is stored. This contains all days with corresponding count of valid records
head (sum_day)
```

**8. What is average number of steps taken per day?**

avg_day <- aggregate(steps ~ date, data=data, mean, na.action =NULL)

This creates a data frame which is stored. This contains all days with corresponding count of valid records

head (avg_day)

9. On which day the individual walked the most?

```
# first we'll aggregate sum for each day as we did earlier
sum_day <- aggregate(steps ~ date, data=data, sum, na.action =NULL)
# now we extract the row with largest value for step
head (sum_day)
```

Extract for largest aggregate of steps

```
# use which.max function for this
sum_day[ which.max(sum_day$steps) , ]
```

**10. On which day the individual walked the least?**

We repeat similar steps for min.

` # first we'll aggregate sum for each day as we did earlier sum_day <- aggregate(steps ~ date, data=data, sum, na.action =NULL) # now we extract the row with least value for step # using which.min () function sum_day[ which.min(sum_day$steps) , ]`

## ‘Plyr’ package

Instead of using aggregate function, an easier alternative is to use the ddply() function from the plyr package. Here is an example:

require ('plyr') d <- ddply (data , .(date), summarise, count= sum(steps, na.rm=TRUE), mean =mean(steps, na.rm=TRUE) , missingCount= sum(is.na(steps))) fix (d)

date count mean missingCount

1 2012-10-01 0 NaN 288

2 2012-10-02 126 0.43750 0

3 2012-10-03 11352 39.41667 0

4 2012-10-04 12116 42.06944 0

5 2012-10-05 13294 46.15972 0

6 2012-10-06 15420 53.54167 0

## Data Imputation Strategy

We firstly do some basic analysis while ignoring the NA or missing values, let’s see what we observe.

**Do we see a trend according to the day of the week?**

# Convert to date format the 'date' column data$date <- as.Date(data$date , "%Y-%m-%d") ## Add day of the week as a column data$day <- weekdays(data$date) ## Aggregate by day sum_day <- aggregate(steps ~ day, data=data, sum, na.action = na.omit) sum_day # no particular trend is observed, Wednesday has highest number of steps.

day steps

1 Friday 86518

2 Monday 69824

3 Saturday 87748

4 Sunday 85944

5 Thursday 65702

6 Tuesday 80546

7 Wednesday 94326

**What if we see the mean steps for each day?**

sum_day <- aggregate(steps ~ day, data=data, mean, na.action = na.omit) sum_day

day steps

1 Friday 42.91567

2 Monday 34.63492

3 Saturday 43.52579

4 Sunday 42.63095

5 Thursday 28.51649

6 Tuesday 31.07485

7 Wednesday 40.94010

Lets reorder from Monday to Sunday..

reorder <- sum_day [ c(2,6,7,5,1,3,4),]

Now we can plot the trend of average steps walked each day..

# Function to plot a time series plot.ts() plot.ts (reorder$steps,xaxt = "n", xlab = "Day of the Week") # Rename the labels for x-axis axis(1, at=1:7, labels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

The trend observed is that the average number of steps on the weekends including Friday is around 45. Weekly drop is on Tuesday and Thursday, with Thursday being the lowest.

## Data Imputation

The simplest way to do data analysis is to ignore the missing values completely. However, this is quite naive, as we do not know what the missing values hold in store, and they could be an important part of the analysis in forming the trend or giving some insights.

There are many ways for data imputation, simplest being replacing the value with the mean.

In our case, we could use mean of that particular day, or mean of that particular day of the week. For example, if we observe that the average on Fridays is 45 steps, if there is a missing value for Friday, we can impute it with 45.

## Impute with Group means ## Basically, we want to see the mean for each day, then if a value is missing, ## we will replace with mean of that particular day #First convert day variable to a factor data$day <- as.factor(data$day) # now impute by Group means data$steps[is.na(data$steps)] <- ave(data$steps, data$day, FUN=function(x)mean(x, na.rm = T))[is.na(data$steps)] # print head head(data)

steps date interval day

1 34.63492 2012-10-01 0 Monday

2 34.63492 2012-10-01 5 Monday

3 34.63492 2012-10-01 10 Monday

4 34.63492 2012-10-01 15 Monday

5 34.63492 2012-10-01 20 Monday

6 34.63492 2012-10-01 25 Monday

Hence, the steps with NA have been updated.

Let’s repeat the daily trend and see if it makes any difference in the analysis.

Imputing with mean of weekday for missing values, in this case shows no particular change in trend.