Google Data Analysis Capstone Project¶

By Hassan Reza Rezaye¶

About this project:¶

This case study belongs to Cyclistics, a fictional company in Chicago, in order to answer to a business questions. The company, which handles a bike-sharing program, has more than 5,800 geotracked bicycles (consists of many types of bikes for different purposes) and locked into a network of about 600 docking stations. Its users are more likely to ride for leisure, but around 30% use the bikes to commute to work each day.

Cyclistic’s marketing strategy ,untill now, relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the lexibility of its pricing plans: single-ride passes, fu l-day passes, and annual memberships. Customers who purchase single-ride or ful-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. 

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno,the director of marketing and your manager,believes that maximizing the number of annual members wil be key to future growth. Rather than creating a marketing campaign that targets al-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

A report will be generated with the following deliverables:¶

1. A clear statement of the business task: 
  - The company's (Cyclistic - a bike sharing company in Chicago) goal is to understand how casual riders and annual members use our bike-sharing service differently, in order to design effective marketing strategies that convert casual riders into members.  

2. A description of al data sources used
   The case study use Cyclistics's - a fictinal name for the purposes of exploring the dataset 
  - historical trip data to analyze and identity trends and behavior of user's bike's usage. This is public data that you can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit you from using riders’ persona ly identifiable information.This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

The data has been made available by Motivate International Inc. under this [license] (https://www.divvybikes.com/data-license-agreement).¶

Key Stakholders:¶

1. Primary Stakeholders:
      - Lily Moreno, Director of Marketing
      -  Cyclistic Executive Team
2. Secondary Stakeholders:
      -  Marketing Analytics Team

The data set is available at this link: previous 12 months of Cyclistic trip data.¶

The data set used in this case study analysis belongs to the years 2019 - 2020:

divvy-trips-2020-q2q4:

    202004-divvy-tripdata.csv
    202005-divvy-tripdata.csv
    202006-divvy-tripdata.csv
    202007-divvy-tripdata.csv
    202008-divvy-tripdata.csv
    202009-divvy-tripdata.csv
    202010-divvy-tripdata.csv
    202011-divvy-tripdata.csv
    202012-divvy-tripdata.csv

divvy-trips-2019-q1q4

    Divvy_Trips_2019_Q1.csv
    Divvy_Trips_2019_Q2.csv
    Divvy_Trips_2019_Q3.csv
    Divvy_Trips_2019_Q4.csv
    Divvy_Trips_2020_Q1.csv

# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create 
## a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#Installing the required pacheges
library(tidyverse)
library(conflicted)

conflict_prefer("filter", "dplyr")
conflict_prefer("lag", "dplyr")

[conflicted] Will prefer dplyr::filter over any other package.
[conflicted] Will prefer dplyr::lag over any other package.

1. Prepare Data for Analysis¶

In this step:¶

- Install required pachages and related library (already installed in Kaggle Notebook editor)
- Upload and read the datasets
- Inpect the data for integrity (Accuracy, Consistency, Completeness, Vality, Uniqueness, time)

A. Inspect the dataset¶

After Uploading and reading the data set from data file check if there is any inconsistency, 
inaccuraces and anamolies from the entire data set.

dt2019q1 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2019_Q1.csv')
dt2019q2 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2019_Q2.csv')
dt2019q3 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2019_Q3.csv')
dt2019q4 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2019_Q4.csv')
dt2020q1 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2020_Q1.csv')

dt202004 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202004-divvy-tripdata.csv')
dt202005 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202005-divvy-tripdata.csv')
dt202006 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202006-divvy-tripdata.csv')
dt202007 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202007-divvy-tripdata.csv')
dt202008 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202008-divvy-tripdata.csv')
dt202009 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202009-divvy-tripdata.csv')
dt202010 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202010-divvy-tripdata.csv')
dt202011 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202011-divvy-tripdata.csv')
dt202012 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202012-divvy-tripdata.csv')

Rows: 365069 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): from_station_name, to_station_name, usertype, gender
dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
num  (1): tripduration
dttm (2): start_time, end_time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1108163 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
dbl  (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
num  (1): 01 - Rental Details Duration In Seconds Uncapped
dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1640718 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): from_station_name, to_station_name, usertype, gender
dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
num  (1): tripduration
dttm (2): start_time, end_time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 704054 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): from_station_name, to_station_name, usertype, gender
dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
num  (1): tripduration
dttm (2): start_time, end_time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 426887 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 84776 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 200274 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 343005 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 551480 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 622361 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 532958 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 388653 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 259716 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 131573 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Lets investigate column names:¶

- while reading the each date set, we noticed:
  * There are inconsistencies in (dt2019q2) column names that need to be aligned with other datasets'comlumn names.
  * We need to rename all columns of the first four data set (dt2019q1-4) with the rest of the data sets'
    (dt2020q1, dt202004-12 ) "ride_id" and "rideable_type".
  * Standardise the data type of "trip_id" and "bikeid" now changed to "ride_id" and "rideable_type" from (num) to (chr).

- The next step is to combine all the data sets as one (large data set) in order to make it ready for analysis and 
  visualisation after inspecting and cleaning it for extracting further inconsistences, inaccuraces and anamolies
  from the entire data set. To do this we need to apply the following steps each has their own details.

# While checking the column names, we found inconsistencies in column names: 
colnames(dt2019q1)
colnames(dt2019q2)
colnames(dt2019q3)
colnames(dt2019q4)
colnames(dt2020q1)
colnames(dt202004)
colnames(dt202005)
colnames(dt202006)
colnames(dt202007)
colnames(dt202008)
colnames(dt202009)
colnames(dt202010)
colnames(dt202011)
colnames(dt202012)

## Changing the column names from upper to lower letter
dt2019q2 <- janitor::clean_names(dt2019q2)
colnames(dt2019q2)

# renaming the column names of the first four datasets (dt2019q1 - q4) to align it with the rest
dt2019q1<- rename(dt2019q1, 
                  ride_id = trip_id,
                  rideable_type = bikeid,
                  started_at = start_time,
                  ended_at = end_time,
                  start_station_name = from_station_name,
                  start_station_id = from_station_id,
                  end_station_name = to_station_name,
                  end_station_id = to_station_id, 
                  member_casual = usertype )
dt2019q2<- rename(dt2019q2, 
                   ride_id = x01_rental_details_rental_id,
                 rideable_type = x01_rental_details_bike_id,
                  started_at = x01_rental_details_local_start_time,
                  ended_at = x01_rental_details_local_end_time,
                  start_station_name = x03_rental_start_station_name,
                  start_station_id = x03_rental_start_station_id,
                  end_station_name = x02_rental_end_station_name,
                  end_station_id = x02_rental_end_station_id, 
                  member_casual = user_type, 
                  gender= member_gender,
                  birthyear = x05_member_details_member_birthday_year,
                 tripduration = x01_rental_details_duration_in_seconds_uncapped)
dt2019q3<- rename(dt2019q3, 
                  ride_id = trip_id,
                  rideable_type = bikeid,
                  started_at = start_time,
                  ended_at = end_time,
                  start_station_name = from_station_name,
                  start_station_id = from_station_id,
                  end_station_name = to_station_name,
                  end_station_id = to_station_id, 
                  member_casual = usertype )
dt2019q4<- rename(dt2019q4, 
                  ride_id = trip_id,
                  rideable_type = bikeid,
                  started_at = start_time,
                  ended_at = end_time,
                  start_station_name = from_station_name,
                  start_station_id = from_station_id,
                  end_station_name = to_station_name,
                  end_station_id = to_station_id, 
                  member_casual = usertype )

# checking if columns are renamed correctly
colnames(dt2019q1)
colnames(dt2019q2)
colnames(dt2019q3)
colnames(dt2019q4)
colnames(dt2020q1)

# It is time to make the "ride_id" and "rideable_type" data type of the datasets (dt2019q1-q4) aligned with the rest of the data sets.
# Standardise the data type of "trip_id" and "bikeid" now changed to "ride_id" and "rideable_type" from (num) to (chr). 
dt2019q1 <- mutate(dt2019q1, ride_id = as.character(ride_id), 
                   rideable_type = as.character(rideable_type))
dt2019q2 <- mutate(dt2019q2, ride_id = as.character(ride_id), 
                   rideable_type = as.character(rideable_type))
dt2019q3 <- mutate(dt2019q3, ride_id = as.character(ride_id), 
                   rideable_type = as.character(rideable_type))
dt2019q4 <- mutate(dt2019q4, ride_id = as.character(ride_id), 
                   rideable_type = as.character(rideable_type))

tibble(dim(dt2019q1),
       dim(dt2019q2),
       dim(dt2019q3),
       dim(dt2019q4),
       dim(dt2020q1),
       dim(dt202004),
       dim(dt202005),
       dim(dt202006),
       dim(dt202007),
       dim(dt202008),
       dim(dt202009),
       dim(dt202010),
       dim(dt202011),
       dim(dt202012))

# If applied str() to all data sets, we'll notice that (dt202012)'s "start_station_id" and "end_station_id" 
 ## are character (chr) by type which also should be changed from (chr) to (num) so to be aligned with the rest. 

#str(dt2019q1)
#str(dt2019q2)
#str(dt2019q3)
#str(dt2019q4)
#str(dt2020q1)
#str(dt202004)
#str(dt202005)
#str(dt202006)
#str(dt202007)
#str(dt202008)
#str(dt202009)
#str(dt202010)
#str(dt202011)
str(dt202012)

spc_tbl_ [131,573 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:131573] "70B6A9A437D4C30D" "158A465D4E74C54A" "5262016E0F1F2F9A" "BE119628E44F871E" ...
 $ rideable_type     : chr [1:131573] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:131573], format: "2020-12-27 12:44:29" "2020-12-18 17:37:15" ...
 $ ended_at          : POSIXct[1:131573], format: "2020-12-27 12:55:06" "2020-12-18 17:44:19" ...
 $ start_station_name: chr [1:131573] "Aberdeen St & Jackson Blvd" NA NA NA ...
 $ start_station_id  : chr [1:131573] "13157" NA NA NA ...
 $ end_station_name  : chr [1:131573] "Desplaines St & Kinzie St" NA NA NA ...
 $ end_station_id    : chr [1:131573] "TA1306000003" NA NA NA ...
 $ start_lat         : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
 $ start_lng         : num [1:131573] -87.7 -87.7 -87.7 -87.7 -87.6 ...
 $ end_lat           : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
 $ end_lng           : num [1:131573] -87.6 -87.7 -87.7 -87.7 -87.6 ...
 $ member_casual     : chr [1:131573] "member" "member" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

# changing the variable as (chr) type to (num) type.
dt202012 <- dt202012 %>% 
            mutate(start_station_id = as.numeric(!is.na(start_station_id)),
                   end_station_id = as.numeric(!is.na(end_station_id)))

# Combing all the data set as one large data set making it ready for further exploration. 
all_trips <- bind_rows(dt2019q1, dt2019q2, dt2019q3, dt2019q4, 
                       dt2020q1, dt202004,
                       dt202005,dt202006,dt202007,dt202008,
                       dt202009, dt202010, dt202011, dt202012)

2. Process of the Dataset:¶

In the step we do some **manipulation** and **cleaning** to make it ready for analysis step.

Here, we try to: 

* handle missing values
* remove duplicates
* trim whitespaces
* reslove incosistences
* create new variables and omit irrelevant values

# Lets look for Missing or NA values.

# after applying dim() and sum(), we observe enormous null values in the entire data set which is the 
# combination of total null values of each column in the data set. 
dim(all_trips)
sum(is.na(all_trips))

# total number of unique missing values in each column
as_tibble(t(colSums(is.na(all_trips[, 1:8]))))  # t(): transpose the table from verticle to horizontal form
as_tibble(t(colSums(is.na(all_trips[, 9:16]))))

cat("\n\n")  

# Here we cover two task: 
    # first, we omit all null values from station name/id and make our entire data set smaller
    # then remove some of the not-usable and irrelevant variables in this analysis and visualisation. 
all_trips_v1 <- all_trips %>%
                        filter(!is.na(start_station_id),
                               !is.na(start_station_name),
                               !is.na(end_station_id), 
                               !is.na(end_station_name),
                              )%>%
                        select(-c(tripduration, gender, birthyear, start_lat, start_lng, end_lat, end_lng))

# with each changes in the entire dataset, the main dataset name (all_trips_V, V for version) changes to the next version by number.

# in the form of a tibble,  we check if there is still null value in each column of the data set.
as_tibble(t(colSums(is.na(all_trips_v1)))) #t: transpose the table from verticle to horizontal form

cat("\n\n") 

# lets also check if the variables/columns, which we wanted to be removed, are omited correctly. 
head(all_trips_v1)

# First:
# we find that how many unique or distinct values are there for each variables.
# From the returned results, we observe that the pattern (name = id) we wanted does not match; it is now (name != id). 
# from such inequality in the n_distinct of each variables, we infer that there are more than one unique station_name for each unique station_id,

n_distinct(all_trips_v1$start_station_name)
n_distinct(all_trips_v1$end_station_name)
n_distinct(all_trips_v1$start_station_id)
n_distinct(all_trips_v1$end_station_id)

# Second: 
# Let check practically and find out what unique values in "start_station_id" has more than one unique value in "start_station_name" in the data set.
all_trips_station1 <- all_trips_v1 %>%
                        group_by(start_station_id, start_station_name) %>%
                        summarise(count_station = n(), .groups = "drop") %>%
                        group_by(start_station_id)%>%
                        filter(n_distinct(start_station_name) > 1, start_station_id > 1) %>%
                        arrange(start_station_id)
all_trips_station2 <- all_trips_v1 %>%
                        group_by(end_station_id, end_station_name) %>%
                        summarise(count_station = n(), .groups = "drop") %>%
                        group_by(end_station_id)%>%
                        filter(n_distinct(end_station_name) > 1, end_station_id > 1) %>%
                        arrange(end_station_id)

# The returned resulte from our inspection confirm this and lists 44 "station id"s with more than one station_name that makes 89 row values in total. 
dim(all_trips_station1)
n_distinct(all_trips_station1$start_station_id)

cat("\n\n")

head(all_trips_station1, 3)
tail(all_trips_station1, 3)
cat("\n\n")
head(all_trips_station2, 3)
tail(all_trips_station2, 3)

# The returned result above shows there are some "station name"s with extra marks (*) and repeated word (temp)
# in most of the names they seems to be irrelevant in the station names. Lets omit those extras.
all_trips_v2 <- all_trips_v1 %>% 
                mutate(end_station_name = str_replace_all(end_station_name, "\\(\\*\\)|\\(Temp\\)", ""),
                       start_station_name = str_replace_all(start_station_name, "\\(\\*\\)|\\(Temp\\)", ""),
                        end_station_name = trimws(end_station_name),
                        start_station_name = trimws(start_station_name))

# While running the following code to check if unique values in station_id = station_name, 
 ## the returned tibble show there are still some names which need to be renamed
 ## so that each unique station name corresponds to one unique station id. 
all_trips_station3 <- all_trips_v2 %>% 
    group_by(start_station_id, start_station_name) %>%
    summarise(station_count = n(), .groups = "drop") %>%
    group_by(start_station_id) %>% 
    filter(n_distinct(start_station_name) > 1, start_station_id > 1) %>%
    arrange(start_station_id)

all_trips_station4 <- all_trips_v2 %>% 
    group_by(end_station_id, end_station_name) %>%
    summarise(station_count = n(), .groups = "drop") %>%
    group_by(end_station_id) %>% 
    filter(n_distinct(end_station_name) > 1, end_station_id > 1) %>%
    arrange(end_station_id)

# "station id"s with more than one station_name`s reduced to 12 from 44 unique id. and the total rows is now 25. 
dim(all_trips_station3)
n_distinct(all_trips_station3$start_station_id)


head(all_trips_station3)
tail(all_trips_station3)
cat("\n\n") 
head(all_trips_station4)
tail(all_trips_station4)

# To get ride of such inconsistences in unique values between station id and station name, lets rename those values in the station_names. 
# Since there are two different unique station names for a unique station_id, we rename the station names
  # with low frequency from station names with high frequency. 
# Next we also remove existing spaces in the names. 

all_trips_station_cleaned <- all_trips_v2 %>%
  mutate(
    end_station_name = recode(end_station_name, 
        "Throop (Loomis) St & Taylor St" = "Loomis St & Taylor St",
        "Throop St & Taylor St" = "Loomis St & Taylor St",
        "Racine Ave (May St) & Fulton St" = "Elizabeth (May) St & Fulton St",
        "Ashland Ave & 21st St" = "Laflin St & Cullerton St",
        "Halsted St & 104th St" = "Western Ave & 104th St",
        "Franklin St & Adams St" = "Franklin St & Quincy St",
        "McClurg Ct & Illinois St" = "New St & Illinois St",
        "Laflin St & Cullerton St" = "Ashland Ave & 21st St",
        "Burling St (Halsted) & Diversey Pkwy"= "Burling St & Diversey Pkwy",
        "Drake Ave & Fullerton Ave" = "St. Louis Ave & Fullerton Ave",
        "Ashland Ave & 73rd St" = "Ashland Ave & 74th St",
        "Jeffery Blvd & 91st St" = "Avenue O & 134th St",
        
        "Michigan Ave & 114th St" = "Halsted St & 104th St"),
    start_station_name = recode(start_station_name, 
        "Throop (Loomis) St & Taylor St" = "Loomis St & Taylor St",
        "Throop St & Taylor St" = "Loomis St & Taylor St",
        "Racine Ave (May St) & Fulton St" = "Elizabeth (May) St & Fulton St",
       "Ashland Ave & 21st St" = "Laflin St & Cullerton St",
        "Halsted St & 104th St" = "Western Ave & 104th St",                        
        "Franklin St & Adams St" = "Franklin St & Quincy St",
        "McClurg Ct & Illinois St" = "New St & Illinois St",
        "Laflin St & Cullerton St" = "Ashland Ave & 21st St",
        "Burling St (Halsted) & Diversey Pkwy"= "Burling St & Diversey Pkwy",
        "Drake Ave & Fullerton Ave" = "St. Louis Ave & Fullerton Ave",
        "Ashland Ave & 73rd St" = "Ashland Ave & 74th St",
        "Jeffery Blvd & 91st St" = "Avenue O & 134th St",
        "Michigan Ave & 114th St" = "Halsted St & 104th St",
        "Dodge Ave & Main St" = "Chicago Ave & Dempster St"),
    end_station_name = trimws(end_station_name),
    start_station_name = trimws(start_station_name)
  )

# A change has made to the entire dataset so we rename our main dataset to the next version.
all_trips_v3 <- all_trips_station_cleaned

# After cleaning station names, with returned below result we again notice two names from start/end_station_names remain uncleaned. 
# we apply the previous renaming opproach to get ride of them too. 
all_trips_station5 <- all_trips_v3 %>% 
    group_by(end_station_id, end_station_name) %>%
    summarise(station_count = n(), .groups = "drop") %>%
    group_by(end_station_id) %>% 
    filter(n_distinct(end_station_name) > 1, end_station_id > 1) %>%
    arrange(end_station_id)
head(all_trips_station5)
tail(all_trips_station5)

all_trips_station6 <- all_trips_v3 %>% 
    group_by(start_station_id, start_station_name) %>%
    summarise(station_count = n(), .groups = "drop") %>%
    group_by(start_station_id) %>% 
    filter(n_distinct(start_station_name) > 1, start_station_id >1) %>%
    arrange(start_station_id)

head(all_trips_station6)
tail(all_trips_station5)

# Here we remame the remained values in the start/end_station_names.

all_trips_station_cleaned2 <- all_trips_v3 %>%
  mutate(
    end_station_name = recode(end_station_name, 
        "Ashland Ave & 21st St" = "Laflin St & Cullerton St",
        "Halsted St & 104th St" = "Western Ave & 104th St"      
         ),
    start_station_name = recode(start_station_name, 
        "Ashland Ave & 21st St" = "Laflin St & Cullerton St",
        "Halsted St & 104th St" = "Western Ave & 104th St"  
         ),
    end_station_name = trimws(end_station_name),
    start_station_name = trimws(start_station_name))

# We reasign the newly created and assigned data set to the main data set name to keep consistency in main dataset names 
  # but just changing their versions. 
all_trips_v4 <- all_trips_station_cleaned2

# Now, we check and notice that the unique start/end_station_name are approaximately equal to start/end_station_id's unique values. 

tibble(start_station_id = n_distinct(all_trips_v4$start_station_id),
        start_station_name = n_distinct(all_trips_v4$start_station_name),
        end_station_id = n_distinct(all_trips_v4$end_station_id),
        end_station_name = n_distinct(all_trips_v4$end_station_name))

#Lets further inspect if there are any NAs or extra spaces "" in the start/end_station_id and Start/end_station_names and then correct them.  
all_trips_v4 %>% filter(is.na(start_station_name)| 
                         is.na(start_station_id) |
                         is.na(end_station_name) |
                         is.na(end_station_id)) %>%
                select(start_station_id, start_station_name, end_station_id, end_station_name)
all_trips_v4 %>% filter(start_station_name == "" |
                         start_station_id == "" |
                         end_station_name == "" |
                         end_station_id == "") %>%
                select( start_station_id, start_station_name, end_station_id, end_station_name)
# from the emnpty tibble we see there are not NAs or "".

# let Look at member_casual and rideable_type variable catagories: 
# In the member_casual column, there are 4 types of categories that should be just member and casual.
# The rideable_type column is consiss of manay types of rides which are indicated in codes. We don`t consider these categories unless it few in number. 
unique(all_trips_v4$member_casual)
head(unique(all_trips_v4$rideable_type), 20)


all_trips_v4 <- all_trips_v4 %>%
                mutate(member_casual = ifelse(member_casual == "Subscriber", "member",
                                       ifelse(member_casual == "Customer", "casual", member_casual)))

3. Analysis of the data:¶

In these step, we perform manipulation and aggregation on data to organize necessary part of the data for analysis through particular tibble and visualisation.  This will make it easy to descover trends and relationships in both members' and casuals' number of ridings and average ride time. We will also be  understand how these two type customers differ in term of using the service on the basis of year, month, week and hours. The aim is to find answer to the business question.

A. Perfome Manipulation:¶

* We add some relavant new variables/columns. 
* then format the new variable with the propare type

#Lets manipulate started_at column and generate new columns (year, months, day, minutes, day_of_week). 
all_trips_v4$date <- as.Date(all_trips_v4$started_at)
all_trips_v4$year <- format(as.Date(all_trips_v4$started_at), "%Y")
all_trips_v4$month <- format(as.Date(all_trips_v4$started_at), "%M")
all_trips_v4$day <- format(as.Date(all_trips_v4$started_at), "%D")
all_trips_v4$day_of_week <- format(as.Date(all_trips_v4$started_at), "%A")
all_trips_v4$minutes <- format(as.POSIXct(all_trips_v4$started_at), "%M")

# We also add ride_duration/ride_length column: it shows how long a member or a casual spent time
  # on each ride for a specidic period of time (hour, minutes, second)
all_trips_v4$ride_length <- difftime(all_trips_v4$ended_at, all_trips_v4$started_at)

# We check the dimention and column names of the dataset if the mentioned columns are generated. 
dim(all_trips_v4)
colnames(all_trips_v4)

cat("\n\n")

# Apply the str() function to observe any anamolies in the structure of the data set. 
# After look through the each columns and their structures, we notice that ride_length is not in proper formate that should be converted to numeric (num) data type. 

str(all_trips_v4)


tibble [7,207,385 × 16] (S3: tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:7207385] "21742443" "21742444" "21742445" "21742446" ...
 $ started_at        : POSIXct[1:7207385], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
 $ ended_at          : POSIXct[1:7207385], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
 $ rideable_type     : chr [1:7207385] "2167" "4386" "1524" "252" ...
 $ start_station_id  : num [1:7207385] 199 44 15 123 173 98 98 211 150 268 ...
 $ start_station_name: chr [1:7207385] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
 $ end_station_id    : num [1:7207385] 84 624 644 176 35 49 49 142 148 141 ...
 $ end_station_name  : chr [1:7207385] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St" "Western Ave & Fillmore St" "Clark St & Elm St" ...
 $ member_casual     : chr [1:7207385] "member" "member" "member" "member" ...
 $ date              : Date[1:7207385], format: "2019-01-01" "2019-01-01" ...
 $ year              : chr [1:7207385] "2019" "2019" "2019" "2019" ...
 $ month             : chr [1:7207385] "00" "00" "00" "00" ...
 $ day               : chr [1:7207385] "01/01/19" "01/01/19" "01/01/19" "01/01/19" ...
 $ day_of_week       : chr [1:7207385] "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
 $ minutes           : chr [1:7207385] "04" "08" "13" "13" ...
 $ ride_length       : 'difftime' num [1:7207385] 390 441 829 1783 ...
  ..- attr(*, "units")= chr "secs"

# We notice that the "ride_length" is not in proper format so convert it to numeric. 
all_trips_v4$ride_length <- as.numeric(as.character(all_trips_v4$ride_length))

# Lets confirm if the conversion is done correctly
summary(all_trips_v4$ride_length)
cat("\n\n")
str(all_trips_v4$ride_length)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1742998      433      770     1482     1413 10632022


 num [1:7207385] 390 441 829 1783 364 ...

#Though data type conversion is done correctly but there are enormous negative values in ride_lenght column. 
#After navigating it further, We found out those negative values (HQ QR) are related to the rides done for quality and maitanance check. 

all_trips_negValue0 <- all_trips_v4 %>%
                        select(start_station_name, ride_length) %>%
                        filter(ride_length < 0, start_station_name == "HQ QR")

all_trips_negValue1 <- all_trips_v4 %>%
                            filter(ride_length < 0, start_station_name == "HQ QR") %>%
                            group_by(start_station_name) %>%
                            summarise(count_negride_length = sum(ride_length <0, na.rm = TRUE),
                                      total_ride_length = n()) 
dim(all_trips_negValue0)
head(all_trips_negValue0)
all_trips_negValue1

#Now, remove the irralevant negative values.
all_trips_v5 <- all_trips_v4[
                        !(all_trips_v4$start_station_name == "HQ QR" | all_trips_v4$ride_length < 0),]

all_trips_negValue1 <- all_trips_v5 %>%
                        select(start_station_name, ride_length) %>%
                        filter(ride_length < 0)

#lets check if those values are omitted or not
all_trips_negValue1
head(all_trips_v5, 3)

Visualization:¶

Through performing the analysis and visualization we convey the message (insights) achieved from the analysis 
to the stakeholder.

What needs to be done:¶

   * Form tibbles from the variables that is required for particular visualisations.
   * We apply visualisation on those tibble (a summary tibble) to illustrate the differences through trend 
     and relationships.

# Data set aggregation on the bases of month and year: in the resulted tibble you see:
    ## number of ride done each year 
    ## statistics related to rides in each year

    all_trips_aggregation <- all_trips_v5 %>%                 
            group_by(member_casual, year) %>%
            summarise(ride_count = n(),
                    mean_ride= round(mean(ride_length/60, na.rm = TRUE), 2),
                    median_ride = round(median(ride_length/60, na.rm = TRUE), 2),
                    min_ride = round(min(ride_length/60, na.rm = TRUE), 2),
                    max_ride = round(max(ride_length/60, na.rm = TRUE), 2), .groups = "drop") %>%
            mutate(percentage = round(ride_count/ sum(ride_count)*100, 2))
    
    all_trips_aggregation

#Counting number of rides done by members and casual and its percentage on the basis of year. 
all_trips_year <- all_trips_v5 %>%  
        group_by(member_casual) %>%
        summarise(ride_count = n(), .groups = "drop") %>%
        mutate(percentage = round(ride_count/ sum(ride_count)*100, 2))

all_trips_year

# We upload library for richer visualization 
library(ggplot2)
library(scales)        # "scale" is uploaded to customise the numbers on y axis
library(gridExtra)     # "gridExtra" is uploaded to customise the plot displays
library(grid) 
grid.newpage()

options(repr.plot.width = 12, repr.plot.height = 5)

# in the below plots:
    ## "geom_bar" is applied to visualise number of rides members and casuals have done each year
    ## "coord_polar" is applied to display a pie chart dislplaying number of rides/ride trips in percentage
    
    ride_countyear <- ggplot(data = all_trips_aggregation) + 
            geom_bar(mapping = aes(x = member_casual, y = ride_count, fill = year),
                     stat = "identity", position = "dodge") + 
            labs(
                title = "Ride trips done by year",
                x = "year",
                y = "ride_count"
                     ) +
          theme_minimal()+
            scale_y_continuous(labels = label_number(accuracy = 0.01))

    ride_countyrptg1 <- ggplot(data = all_trips_year, 
                               mapping = aes(x = "", y = ride_count, fill = member_casual)) + 
            geom_bar(stat = "identity", width = 0.5) + 
            coord_polar("y")+
            geom_text(aes(label = paste0(percentage, "%")),
                      position = position_stack(vjust = 0.5), 
                      color = "yellow")+
            labs(title = "Number of ride in percentage")+
            theme_void()  
            
    grid.arrange( ride_countyear, ride_countyrptg1, ncol = 2)

Description:¶

Riding by year¶

The bar char displays number of rides and the pie chart shows this numbers in percentage for the years 2019 and 2020.

- The bar chart reflects increased ride trips in 2020 then 2019 for both riders.
    * What is more important is the shift in the number of rides done by casuals which increased three time than 
      last year and a one third increase then members in 2020.
    * It means member did not ride more bike than casual in 2020 (which was opposit in the previous year),but 
      it can certainly be said it increased nearly two times than last year.

- The pie chart show the percentage of ride numbers by member casual in bother together.
    * It seem that number of rides done by members (69.77%) are considerably higher than casuals (30.23%).

# Dataset is aggregated on the bases of month and year displaying member_casual's riding time behavior or difference: 
# in the returned tibble you can see:
    ## mean_ride done each each month or through out the four season of the year.  
    ## number of rides / ride trips done each month 
    
    month_meanrides <- all_trips_v5 %>% mutate(month = month(started_at, label = TRUE)) %>%
                    group_by(member_casual, year, month) %>%
                    summarise(mean_ride = mean(ride_length/60, na.rm = TRUE), 
                            ride_count = n(), .groups = "drop")
    dim(month_meanrides) 
    head(month_meanrides, 3)
    tail(month_meanrides, 3)

options(repr.plot.width = 9, repr.plot.height = 5)

    month_meanride <- ggplot(data =  month_meanrides) + 
            geom_bar(mapping = aes(x = month, y = mean_ride, fill = member_casual), 
                     stat = "identity", position = "dodge") +
            facet_grid( member_casual ~ year ) +
            labs(
                title = "average ride time spent each month by members and casuals",
                x = "weekdays",
                y = "mean_ride"
                     ) +
          theme_minimal()+
              scale_y_continuous(labels = label_number(accuracy = 0.01)) 

    month_countride <- ggplot(data =  month_meanrides) + 
            geom_bar(mapping = aes(x = month, y = ride_count, fill = member_casual), 
                     stat = "identity", position = "dodge") +
            facet_grid( member_casual ~ year ) +
            labs(
                title = "Number of rides done each month by members and casuals",
                x = "month",
                y = "ride_count"
                     ) +
          theme_minimal()+
              scale_y_continuous(labels = label_number(accuracy = 0.01)) 
    month_meanride 
    month_countride

Description¶

Riding by months¶

These two plots are generated based on the average ride time (mean_ride) and number of rides(ride_length) on each month along the whole year 2019 and 2020 or on the four seasons of the year.   

Overall, it looks like average ride lenght/time is nearly three times bigger for casuals then members. 
- The first plot with red color on the top shows stable average rides time for **casuals** in the 2019 
   * but with extremly high everage ride time on February which is almost three time larger then the rest of the months. 
   * this numbers changed in the next year(2020)showing the highest everage ride time spent using the bikes in the beging 
     of the year or in the winter (Januray and February). 
   * Following this season,this number decreased nearly two times in the spring on the March and April. 
   * it kept decreasing steadliy till the end of the year and reached to its lowest point on december. 
- However, The everage ride time, **for members** who spent less time using the bikes, were nearly constant in 
  the year 2020. In particular, members' highest everage ride length was in the spring (April and May) and the 
  begining of fall (Jun and Jully).  

In general, the second plot illustrates similar pattern of ride trips done by both member and casuals although there 
are some difference in the number of their rides along the year. 

- The plot with red colore illustrate ride trips or ride_count for casuals in both year with relative increase in 2020. 
   * The period when casuals rides the most started from the end of spring (May),reached to its the highest ride numbers 
     in the end of summer (August) and reduced till the end of Fall (November). 
   * the lowest number of rides is in winter which is insignificant mean that casuals are not willing to use the bike.  
- By contrast, members did more number of rides the whole year on each month than casual. 
   * In 2019, the most number of ridings started from the mid-spring (April), got to its pick on August and reduced
     till mid-Fall (October).
   * Despite having nearly the same pettern and relatively decreased number of ride in 2020,members experienced their
     lowest ride trips in the mid-spring (April) in 2020.

# # Data set is aggregated on the bases of weekday and year displaying member_casual's riding time pattern.
# in the result tibble you can see:
    ## mean_ride done each each day or through out the whole week in each year.  
    ## number of rides / ride trips done each day. 
    weekday_ride <- all_trips_v5 %>% 
                    mutate(weekday = wday(started_at, label = TRUE))%>%
                    group_by(member_casual, year, weekday, ride_length) %>%
                    summarise( ride_count = n(), .groups = "drop_last")
    head(weekday_ride, 3)

options(repr.plot.width = 12, repr.plot.height = 6)

# Data set is aggregated on the bases of weekday and year displaying member_casual's riding time pattern.
# in the result tibble you can see:
    ## mean_ride done each each day or through out the whole week in each year.  
    ## number of rides / ride trips done each day.
    weekday_meanride <- all_trips_v5 %>% 
                    mutate(weekday = wday(started_at, label = TRUE))%>%
                    group_by(member_casual, year, weekday) %>%
                    summarise(mean_ride = round(mean(ride_length/60, na.rm = TRUE), 2),
                              ride_count = n(), .groups = "drop_last")
    dim(weekday_meanride) 
    head(weekday_meanride) 

# To better understand the riding pattern of the members and casual, 
# it is suitbale to consider the mean_ride on the visualization.

# In the visualisation below: 
    ## the first plot displays the avarage/mean rides time spend by member and casuals each day.
    ## the second plot displays number of ride/ ride trips/ frequency of rides done each day. 
options(repr.plot.width = 8, repr.plot.height = 4)

        ggplot(data = weekday_meanride) +
        geom_bar(mapping = aes(x = weekday, y = mean_ride, fill = member_casual), 
                 stat = "identity", position = "dodge") +
        facet_grid(member_casual ~ year)+
        labs(
                title = "Average ride time spent each day by members and casuals",
                x = "weekdays",
                y = "mean_ride"
                     ) +
              theme_minimal() + 
        scale_y_continuous(labels = label_number(accuracy = 0.01))
        
        ggplot(data = weekday_meanride) +
        geom_bar(mapping = aes(x = weekday, y = ride_count, fill = member_casual), 
                 stat = "identity", position = "dodge") +
        facet_grid(member_casual ~ year)+
        labs(
                title = "Number of rides done each day by members and casuals",
                x = "weekdays",
                y = "ride_count"
                     ) +
              theme_minimal()+
        scale_y_continuous(labels = label_number(accuracy = 0.01))

Discription:¶

Riding by weekdays¶

 Both plots compares mean ride time spent each day and number of rides done by both member and casuals.

 - The first plot displays the mean ride time of casuals who spent almost four time longer than members.
   * In 2020, mean ride time, which was larger in the middle of the week in 2019,decreased and it is larger 
     in the weekends (saturday and sanday) with minor reduction then previous year)
 - In contrast,the mean ride time remained stable for members in both years with slightly larger mean rides 
   on the weekends. 

 _ In the second plot,the frequency of rides for casual has almost the same pattern in both years with a relatively 
   small increase in the second year.
   * Accordingly, casuals use more bikes on the weekends which is low on the other week days. It means casuals 
     use bikes more for leisures in the weekends. This pattern is complately opposit for members.  
   * Member mostly used bikes on the weekdays in both year otherthan weekends which must be for jobs or tasks 
     that needed to be done.
   * Another point worth to be noted is that number of times members used bikes were mostly in the middle 
     of the week in 2019 which reduced in the 2020 and got more stable during the week.
   * On top of everything, members' ride trips is near three time higher than that of casuals

# Aggregation on the bases of each hours in 24 hours displaying member_casual's riding time and bike usage pattern.
# in the result tibble you can see:
    ## mean_ride done each each day or during 24 hours.  
    ## number of rides / ride trips done each hour.
    hours_meanride <- all_trips_v5 %>%
                            mutate( hour = format(as.POSIXct(started_at), "%H")) %>%
                            group_by(member_casual, year, hour) %>%
                            summarise(meanride = mean(ride_length/60, na.rm = TRUE),
                                      ride_count = n(), .groups = "drop_last")
    dim(hours_meanride)
    head(hours_meanride, 3)
    tail(hours_meanride, 3)
# ride length or time of the ride is calculated in minutes.

min(hours_meanride$meanride[hours_meanride$member_casual == "member"])
max(hours_meanride[hours_meanride$member_casual =='member','meanride'])
mean(hours_meanride$meanride[hours_meanride$member_casual == "member"])
cat("\n")
min(hours_meanride$meanride[hours_meanride$member_casual == "casual"])
max(hours_meanride$meanride[hours_meanride$member_casual == "casual"])
mean(hours_meanride$meanride[hours_meanride$member_casual == "casual"])

# In the visualisation below: 
    ## the first plot displays the avarage/mean rides time spend by member and casuals each hour.
    ## the second plot displays number of ride/ ride trips/ frequency of rides done each hour. 
    hour_mride <- ggplot(data = hours_meanride) +
                    geom_bar(mapping = aes(x = hour, y = meanride, fill = member_casual), 
                             stat = "identity", position = "dodge") + 
                    facet_grid(member_casual ~.) +
                    labs(
                        title = "Average ride time spent by member and casuals each hour",
                        x = "hour",
                        y = "mean_ride"
                      ) +
                  theme_minimal()+
                    scale_y_continuous(labels = label_number(accuracy = 0.01))

    hour_cride <- ggplot(data = hours_meanride) +
                    geom_bar(mapping = aes(x = hour, y = ride_count, fill = member_casual), 
                             stat = "identity", position = "dodge") + 
                    facet_grid( member_casual ~.) +
                    labs(
                        title = "Number of ride done by member and casuals each hour",
                        x = "hour",
                        y = "ride_count"
                      ) +
                  theme_minimal()+
                    scale_y_continuous(labels = label_number(accuracy = 0.01))
    hour_mride
    hour_cride

Description:¶

Riding by hours¶

Both plot display average ride time (mean_ride) and number of ride (ride_count) done by member and casuals each hour
during 24 hours. 

The first plot visualise average ride time by both member and casual:
- It seems that casual riders on average spent more time in the midnight from 12am to 5am with the largest average ride 
  time in 3am and then 4am while their average 
  spent time is stable for the rest hours. 
_ On the contrast, member's average ride time of using the bikes is almost stable in 24 hours but with moderate increase
  in their average ride time in the midnight from 2am to 3am. 

The second plot demonstrate how many time the member and casual used bikes each hour during in the 24 hours. 
- the bar chart at the top counts the ride trips for each hour done by the casuals.
  * In particular, the usage of bikes started increasing from 6am, reached its pick at 5pm and gain reduction till 12am.
  * It means that the most usege of bike was during the pariod from 12pm to 7pm.
_ On the other hand, number of trips done by member were fluctuating and greater than casuals in general. 
  * It means their number of trips got two time at their pick: first on 8am and then 7pm. 
  * Their riding periods, when they ride the most, is from 6am till 10pm. 
  * Addintionally, rideing is nearly stable in this period except (7am - 8am) and (4pm to 6pm).

# Generate two summary tibbles that is based on "member_casual" and "routes".
# with another variable (count_station), It is to find the most used routes by members and casual.  
    stations_route1<- all_trips_v5 %>%
                    mutate(routes = str_c(start_station_name, "->", end_station_name)) %>%
                    group_by(member_casual, routes) %>%
                    filter(member_casual == "casual") %>%
                    summarise(count_stations = n(), .groups = "drop") %>%
                    arrange (desc(count_stations)) %>%
                    slice_head(n= 20) 
    head(stations_route1)
    
     stations_route2<- all_trips_v5 %>%
                    mutate(routes = str_c(start_station_name, "->", end_station_name)) %>%
                    group_by(member_casual, routes) %>%
                    filter(member_casual == "member") %>%
                    summarise(count_stations = n(), .groups = "drop") %>%
                    arrange (desc(count_stations)) %>%
                    slice_head(n= 20)
    dim(stations_route2) 
    head(stations_route2)
    tail(stations_route2)

options(repr.plot.width = 10, repr.plot.hight = 5)
# Creating a visualisation that display most used routes
    ggplot(data = stations_route1, mapping = aes( x = reorder(routes, count_stations), y = count_stations, 
                                                 fill = count_stations)) +
           geom_bar(stat = "identity", position = "dodge", show.legend = TRUE)+
           coord_flip()+ 
            labs(
            title = "Top 20 Most Frequent Routes for casual riders",
            x = "Route",
            y = "Number of Rides"
                 ) +
          theme_minimal()
    
    cat(" 
         Description: 
         Count_station_routes (casuals)
         _ this plot display top 20 routes by order used by casuals.
         _ It seem the most of the routs goes to recreational locations
        \n")

    ggplot(data = stations_route2, mapping = aes( x = reorder(routes, count_stations), y = count_stations, 
                                            fill = count_stations)) +
           geom_bar(stat = "identity", position = "dodge", show.legend = TRUE)+
           coord_flip() +
           labs(
            title = "Top 20 Most Frequent Routes for member riders",
            x = "Route",
            y = "Number of Rides"
          ) +
          theme_minimal()
    
    cat("
        Description: 
        Count_station_routes (members)
         this plot display top 20 routes by order used by members
        \n")

 
     Description: 
     Count_station_routes (casuals)
     _ this plot display top 20 routes by order used by casuals.
     _ It seem the most of the routs goes to recreational locations

    Description: 
    Count_station_routes (members)
     this plot display top 20 routes by order used by members

stations_route_top20 <- all_trips_v5 %>%
                mutate(routes = str_c(start_station_name, "->", end_station_name)) %>%
                group_by(member_casual, year, routes) %>%
                summarise(count_stations = n(), .groups = "drop") %>%
                group_by(member_casual) %>%
                slice_max(count_stations, n = 10) %>%
                ungroup() %>%
                arrange (desc(count_stations))
 head(stations_route_top20)

options(repr.plot.width = 12, repr.plot.height = 5)
stations_routes_top20 <- ggplot(data = stations_route_top20, mapping = aes( x = reorder(routes, count_stations), 
                                                                           y = count_stations, fill = member_casual )) +
       geom_bar(stat = "identity", position = "dodge", show.legend = TRUE)+
       coord_flip() +
       labs(
        title = "Top 20 Most Frequent Routes for member - casual riders",
        x = "Route",
        y = "Number of Rides"
            ) +
      theme_minimal()
stations_routes_top20

Conclusion:¶

The object of this case study is to answer to the business question: how do members and casuals ride the bikes differently. Based on the analysis of trend and behavior they showed through particular priod of the time (year, month, weeksays and hours) and routes, It will eneble stackholders to develope effective martketing strategy to convert casuals to members.

Findings:¶

Riding trend by year:¶

The bar char displays number of rides and the pie chart shows this numbers in percentage for the years 2019 and 2020.

It seems casuals' ride trips, with increased three times in 2020 than 2019, 

    - The bar chart reflects increased ride trips in 2020 then 2019 for both riders.
        * What is more important is the shift in the number of rides done by casuals which increased three time 
          than last year and a one third increase then members in 2020.
        * It means member did not ride more bike than casual in 2020 (which was opposit in the previous year),
          but it can certainly be said it increased nearly two times than last year.

    - The pie chart show the percentage of ride numbers by member casual in bother together.
        * It seem that number of rides done by members (69.77%) are considerably higher than casuals (30.23%).

Rides by month:¶

Average ride time:¶

Overall, average ride time is nearly three time bigger for casuals than member. 

Casuals, specifically, with stable maverage ride length in 2019 with just three time larger averages in february changed in 2020.

    - Pich points: 89000m (January) and 7600m (february) or mid-winter
    - Lowest points: 1250m (October) and 1200m (December) or during Fall and begining of winter 

The everage ride time for members were nearly constant in both years but in 2020:

    - Pick points: 1500m April to 1000 May in spring 
    - lowest points: 400m January and 450 (December) in the winter.

Number of Rides:¶

Both casuals and membes have almost similar patter in the ride trips with small differences:

     - Most trips: May (90000t) end of spring to November (70000t) end of fall.
     - Pick point: August (2800000t)
     - Lowest points: December (30000) to February(20000t) in the winter 

By contrast, in 2020 members did more number of rides the whole year on each month than casual. 

     - Most trips: Jun (190000t) less than (350000t previous year) to October (210000t) less than (400000t previous year) 
       or mid-spring to mid-fall. 
     - Pick point: August (320000t) less than (400000t prvious year).
     - Lowest points: April (50000) and December (80000) to February(20000t) in the winter

Trips by weekdays¶

Average ride time:¶

Casuals spent almost four time longer than members on average during the weekdays in general.

Average ride time for Casuals during the week: 
    - Pick points: 56m (Sunday) and 50m (Saturday) on weekends
    - Lowest points: 42m (tuesday and Wednesday) in the middle of the weeks 

In contrast,the mean ride time (around 18m) remain stable for members in both years with slightly 
larger mean rides on the weekends.

Number of rides by casuals:¶

Frequency of rides for casual were almost the same pattern in both years with a relatively small increase 
in the second year.
    - Most trips:  Sutarday (300000t),Sunday (250000t), and Friday (195000t) 
    - It means casuals use bikes more for leisures in the weekends.

In conrrast member mostly used bikes on the weekdays which is around 300000t in 2020 which is considerably reduced from 
around 460000t in 2019). This means they use the service mostly for necessary task not leisures.

Ride by hour:¶

Average ride time¶

As a whole, Casuals spent three times more on average than members.
casuals:
   - High everage points: 12am (100m) to 5am (100) at mid-night
   - Pick points: 3am and 4am (190m and 140m)  
   - Lowest points: 7am and 8am (30m and 50m) in the morning

 members:
   - member's average ride time is almost stable in 24 hours (around 20m)
   - but with moderate increased in the midnight from 2am to 3am (around 25m)

Number of rides:¶

Casuals: 
    - Most trips: start 7am and decreased till 11pm:
    - Highest trips - 11am to 8pm (around 100000t)  
    - Pick point trips: 7pm (150000t)
    - lowest trips: 1am to 6am (15000t)

Members:
    - Most trips: 6am to 9pm (around 140000t)
    - Highest trips: First - 7am - 8am (250000t - 290000t). 
                     Second - 4pm and 5pm and 6pm (300000t - 390000t - 250000)
    - lowest trips: 12am to 4am (15000t)

By Routes¶

Average ride time:¶

Casual: Top 20 routes shown by most used one in order. It seem most of the routs goes to recreational locations.

Recommondations:¶

based on the conclusion, it can be suggested that the executives and the marketing department should focus on the area or stations where casuals commutes more while using the service. They can also lay their marketing compign based on the time of rides they spent using the bikes. Fequency of trips they make by the bike can also be influential if taken into consideration in their marketing strategy. Both longer time of rides and the high frequency of trips can potentially increase the chances of converting casuals to member. These can easily be understood from different and specific time periods shown in the visualisations.

ride_id	started_at	ended_at	rideable_type	tripduration	start_station_id	start_station_name	end_station_id
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
0	0	0	0	3541683	83583	94656	98105

end_station_name	member_casual	gender	birthyear	start_lat	start_lng	end_lat	end_lng
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
110881	0	4100889	4080434	3818004	3818004	3822259	3822259

start_station_id	start_station_name	count_station
<dbl>	<chr>	<int>
2	Buckingham Fountain	18220
2	Buckingham Fountain (Temp)	4769
19	Loomis St & Taylor St (*)	1944

end_station_id	end_station_name	count_station
<dbl>	<chr>	<int>
2	Buckingham Fountain	16194
2	Buckingham Fountain (Temp)	4356
19	Loomis St & Taylor St (*)	2023

end_station_id	end_station_name	station_count
<dbl>	<chr>	<int>
208	Ashland Ave & 21st St	2237
208	Laflin St & Cullerton St	1526
709	Halsted St & 104th St	18
709	Western Ave & 104th St	6

ride_id	started_at	ended_at	rideable_type	start_station_id	start_station_name	end_station_id	end_station_name	member_casual
<chr>	<dttm>	<dttm>	<chr>	<dbl>	<chr>	<dbl>	<chr>	<chr>
21742443	2019-01-01 00:04:37	2019-01-01 00:11:07	2167	199	Wabash Ave & Grand Ave	84	Milwaukee Ave & Grand Ave	Subscriber
21742444	2019-01-01 00:08:13	2019-01-01 00:15:34	4386	44	State St & Randolph St	624	Dearborn St & Van Buren St (*)	Subscriber
21742445	2019-01-01 00:13:23	2019-01-01 00:27:12	1524	15	Racine Ave & 18th St	644	Western Ave & Fillmore St (*)	Subscriber
21742446	2019-01-01 00:13:45	2019-01-01 00:43:28	252	123	California Ave & Milwaukee Ave	176	Clark St & Elm St	Subscriber
21742447	2019-01-01 00:14:52	2019-01-01 00:20:56	1170	173	Mies van der Rohe Way & Chicago Ave	35	Streeter Dr & Grand Ave	Subscriber
21742448	2019-01-01 00:15:33	2019-01-01 00:19:09	2437	98	LaSalle St & Washington St	49	Dearborn St & Monroe St	Subscriber

member_casual	year	month	mean_ride	ride_count
<chr>	<chr>	<ord>	<dbl>	<int>
casual	2019	Jan	47.32472	4602
casual	2019	Feb	145.60485	2638
casual	2019	Mar	52.30590	15923

A tibble: 6 × 2
start_station_name	ride_length
<chr>	<dbl>
HQ QR	-1
HQ QR	-11
HQ QR	-14
HQ QR	-10
HQ QR	-93
HQ QR	-13

A tibble: 4 × 8
member_casual	year	ride_count	mean_ride	median_ride	min_ride	max_ride	percentage
<chr>	<chr>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
casual	2019	880631	57.02	25.83	1.02	177200.37	12.24
casual	2020	1293788	48.08	21.98	0.00	156450.40	17.99
member	2019	2937360	14.33	9.80	1.02	150943.90	40.84
member	2020	2081375	15.63	11.13	0.00	93793.52	28.94

A tibble: 6 × 3
member_casual	routes	count_stations
<chr>	<chr>	<int>
member	Wacker Dr & Washington St->Michigan Ave & Washington St	2413
member	Lake Park Ave & 56th St->University Ave & 57th St	2385
member	Morgan St & Polk St->Loomis St & Lexington St	2383
member	Wabash Ave & Roosevelt Rd->Wabash Ave & 16th St	2318
member	Kingsbury St & Erie St->Kingsbury St & Kinzie St	2314
member	University Ave & 57th St->Lake Park Ave & 56th St	2314

Cyclistic bike-share analysis case study!¶

Google Data Analysis Capstone Project¶

By Hassan Reza Rezaye¶

About this project:¶

A report will be generated with the following deliverables:¶

The data has been made available by Motivate International Inc. under this [license] (https://www.divvybikes.com/data-license-agreement).¶

Key Stakholders:¶

The data set is available at this link: previous 12 months of Cyclistic trip data.¶

1. Prepare Data for Analysis¶

In this step:¶

A. Inspect the dataset¶

Lets investigate column names:¶

2. Process of the Dataset:¶

3. Analysis of the data:¶

A. Perfome Manipulation:¶

4. Share¶

Visualization:¶

What needs to be done:¶

Description:¶

Riding by year¶

Description¶

Riding by months¶

Discription:¶

Riding by weekdays¶

Description:¶

Riding by hours¶

Conclusion:¶

Findings:¶

Riding trend by year:¶

Rides by month:¶

Average ride time:¶

Number of Rides:¶

Trips by weekdays¶

Average ride time:¶

Number of rides by casuals:¶

Ride by hour:¶

Average ride time¶

Number of rides:¶

By Routes¶

Average ride time:¶

Recommondations:¶