This case study belongs to Cyclistics, a fictional company in Chicago, in order to answer to a business questions. The company, which handles a bike-sharing program, has more than 5,800 geotracked bicycles (consists of many types of bikes for different purposes) and locked into a network of about 600 docking stations. Its users are more likely to ride for leisure, but around 30% use the bikes to commute to work each day.
Cyclistic’s marketing strategy ,untill now, relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the lexibility of its pricing plans: single-ride passes, fu l-day passes, and annual memberships. Customers who purchase single-ride or ful-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno,the director of marketing and your manager,believes that maximizing the number of annual members wil be key to future growth. Rather than creating a marketing campaign that targets al-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
1. A clear statement of the business task:
- The company's (Cyclistic - a bike sharing company in Chicago) goal is to understand how casual riders and annual members use our bike-sharing service differently, in order to design effective marketing strategies that convert casual riders into members.
2. A description of al data sources used
The case study use Cyclistics's - a fictinal name for the purposes of exploring the dataset
- historical trip data to analyze and identity trends and behavior of user's bike's usage. This is public data that you can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit you from using riders’ persona ly identifiable information.This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
1. Primary Stakeholders:
- Lily Moreno, Director of Marketing
- Cyclistic Executive Team
2. Secondary Stakeholders:
- Marketing Analytics Team
The data set used in this case study analysis belongs to the years 2019 - 2020:
divvy-trips-2020-q2q4:
202004-divvy-tripdata.csv
202005-divvy-tripdata.csv
202006-divvy-tripdata.csv
202007-divvy-tripdata.csv
202008-divvy-tripdata.csv
202009-divvy-tripdata.csv
202010-divvy-tripdata.csv
202011-divvy-tripdata.csv
202012-divvy-tripdata.csv
divvy-trips-2019-q1q4
Divvy_Trips_2019_Q1.csv
Divvy_Trips_2019_Q2.csv
Divvy_Trips_2019_Q3.csv
Divvy_Trips_2019_Q4.csv
Divvy_Trips_2020_Q1.csv
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load
library(tidyverse) # metapackage of all tidyverse packages
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
list.files(path = "../input")
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create
## a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
#Installing the required pacheges
library(tidyverse)
library(conflicted)
conflict_prefer("filter", "dplyr")
conflict_prefer("lag", "dplyr")
- Install required pachages and related library (already installed in Kaggle Notebook editor)
- Upload and read the datasets
- Inpect the data for integrity (Accuracy, Consistency, Completeness, Vality, Uniqueness, time)
After Uploading and reading the data set from data file check if there is any inconsistency,
inaccuraces and anamolies from the entire data set.
dt2019q1 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2019_Q1.csv')
dt2019q2 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2019_Q2.csv')
dt2019q3 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2019_Q3.csv')
dt2019q4 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2019_Q4.csv')
dt2020q1 <- read_csv('/kaggle/input/divvy-trips-2019-q1q4/Divvy_Trips_2020_Q1.csv')
dt202004 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202004-divvy-tripdata.csv')
dt202005 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202005-divvy-tripdata.csv')
dt202006 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202006-divvy-tripdata.csv')
dt202007 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202007-divvy-tripdata.csv')
dt202008 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202008-divvy-tripdata.csv')
dt202009 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202009-divvy-tripdata.csv')
dt202010 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202010-divvy-tripdata.csv')
dt202011 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202011-divvy-tripdata.csv')
dt202012 <- read_csv('/kaggle/input/divvy-trips-2020-q2q4/202012-divvy-tripdata.csv')
- while reading the each date set, we noticed:
* There are inconsistencies in (dt2019q2) column names that need to be aligned with other datasets'comlumn names.
* We need to rename all columns of the first four data set (dt2019q1-4) with the rest of the data sets'
(dt2020q1, dt202004-12 ) "ride_id" and "rideable_type".
* Standardise the data type of "trip_id" and "bikeid" now changed to "ride_id" and "rideable_type" from (num) to (chr).
- The next step is to combine all the data sets as one (large data set) in order to make it ready for analysis and
visualisation after inspecting and cleaning it for extracting further inconsistences, inaccuraces and anamolies
from the entire data set. To do this we need to apply the following steps each has their own details.
# While checking the column names, we found inconsistencies in column names:
colnames(dt2019q1)
colnames(dt2019q2)
colnames(dt2019q3)
colnames(dt2019q4)
colnames(dt2020q1)
colnames(dt202004)
colnames(dt202005)
colnames(dt202006)
colnames(dt202007)
colnames(dt202008)
colnames(dt202009)
colnames(dt202010)
colnames(dt202011)
colnames(dt202012)
## Changing the column names from upper to lower letter
dt2019q2 <- janitor::clean_names(dt2019q2)
colnames(dt2019q2)
# renaming the column names of the first four datasets (dt2019q1 - q4) to align it with the rest
dt2019q1<- rename(dt2019q1,
ride_id = trip_id,
rideable_type = bikeid,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype )
dt2019q2<- rename(dt2019q2,
ride_id = x01_rental_details_rental_id,
rideable_type = x01_rental_details_bike_id,
started_at = x01_rental_details_local_start_time,
ended_at = x01_rental_details_local_end_time,
start_station_name = x03_rental_start_station_name,
start_station_id = x03_rental_start_station_id,
end_station_name = x02_rental_end_station_name,
end_station_id = x02_rental_end_station_id,
member_casual = user_type,
gender= member_gender,
birthyear = x05_member_details_member_birthday_year,
tripduration = x01_rental_details_duration_in_seconds_uncapped)
dt2019q3<- rename(dt2019q3,
ride_id = trip_id,
rideable_type = bikeid,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype )
dt2019q4<- rename(dt2019q4,
ride_id = trip_id,
rideable_type = bikeid,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype )
# checking if columns are renamed correctly
colnames(dt2019q1)
colnames(dt2019q2)
colnames(dt2019q3)
colnames(dt2019q4)
colnames(dt2020q1)
# It is time to make the "ride_id" and "rideable_type" data type of the datasets (dt2019q1-q4) aligned with the rest of the data sets.
# Standardise the data type of "trip_id" and "bikeid" now changed to "ride_id" and "rideable_type" from (num) to (chr).
dt2019q1 <- mutate(dt2019q1, ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type))
dt2019q2 <- mutate(dt2019q2, ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type))
dt2019q3 <- mutate(dt2019q3, ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type))
dt2019q4 <- mutate(dt2019q4, ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type))
tibble(dim(dt2019q1),
dim(dt2019q2),
dim(dt2019q3),
dim(dt2019q4),
dim(dt2020q1),
dim(dt202004),
dim(dt202005),
dim(dt202006),
dim(dt202007),
dim(dt202008),
dim(dt202009),
dim(dt202010),
dim(dt202011),
dim(dt202012))
# If applied str() to all data sets, we'll notice that (dt202012)'s "start_station_id" and "end_station_id"
## are character (chr) by type which also should be changed from (chr) to (num) so to be aligned with the rest.
#str(dt2019q1)
#str(dt2019q2)
#str(dt2019q3)
#str(dt2019q4)
#str(dt2020q1)
#str(dt202004)
#str(dt202005)
#str(dt202006)
#str(dt202007)
#str(dt202008)
#str(dt202009)
#str(dt202010)
#str(dt202011)
str(dt202012)
# changing the variable as (chr) type to (num) type.
dt202012 <- dt202012 %>%
mutate(start_station_id = as.numeric(!is.na(start_station_id)),
end_station_id = as.numeric(!is.na(end_station_id)))
# Combing all the data set as one large data set making it ready for further exploration.
all_trips <- bind_rows(dt2019q1, dt2019q2, dt2019q3, dt2019q4,
dt2020q1, dt202004,
dt202005,dt202006,dt202007,dt202008,
dt202009, dt202010, dt202011, dt202012)
In the step we do some **manipulation** and **cleaning** to make it ready for analysis step.
Here, we try to:
* handle missing values
* remove duplicates
* trim whitespaces
* reslove incosistences
* create new variables and omit irrelevant values
# Lets look for Missing or NA values.
# after applying dim() and sum(), we observe enormous null values in the entire data set which is the
# combination of total null values of each column in the data set.
dim(all_trips)
sum(is.na(all_trips))
# total number of unique missing values in each column
as_tibble(t(colSums(is.na(all_trips[, 1:8])))) # t(): transpose the table from verticle to horizontal form
as_tibble(t(colSums(is.na(all_trips[, 9:16]))))
cat("\n\n")
# Here we cover two task:
# first, we omit all null values from station name/id and make our entire data set smaller
# then remove some of the not-usable and irrelevant variables in this analysis and visualisation.
all_trips_v1 <- all_trips %>%
filter(!is.na(start_station_id),
!is.na(start_station_name),
!is.na(end_station_id),
!is.na(end_station_name),
)%>%
select(-c(tripduration, gender, birthyear, start_lat, start_lng, end_lat, end_lng))
# with each changes in the entire dataset, the main dataset name (all_trips_V, V for version) changes to the next version by number.
# in the form of a tibble, we check if there is still null value in each column of the data set.
as_tibble(t(colSums(is.na(all_trips_v1)))) #t: transpose the table from verticle to horizontal form
cat("\n\n")
# lets also check if the variables/columns, which we wanted to be removed, are omited correctly.
head(all_trips_v1)
# First:
# we find that how many unique or distinct values are there for each variables.
# From the returned results, we observe that the pattern (name = id) we wanted does not match; it is now (name != id).
# from such inequality in the n_distinct of each variables, we infer that there are more than one unique station_name for each unique station_id,
n_distinct(all_trips_v1$start_station_name)
n_distinct(all_trips_v1$end_station_name)
n_distinct(all_trips_v1$start_station_id)
n_distinct(all_trips_v1$end_station_id)
# Second:
# Let check practically and find out what unique values in "start_station_id" has more than one unique value in "start_station_name" in the data set.
all_trips_station1 <- all_trips_v1 %>%
group_by(start_station_id, start_station_name) %>%
summarise(count_station = n(), .groups = "drop") %>%
group_by(start_station_id)%>%
filter(n_distinct(start_station_name) > 1, start_station_id > 1) %>%
arrange(start_station_id)
all_trips_station2 <- all_trips_v1 %>%
group_by(end_station_id, end_station_name) %>%
summarise(count_station = n(), .groups = "drop") %>%
group_by(end_station_id)%>%
filter(n_distinct(end_station_name) > 1, end_station_id > 1) %>%
arrange(end_station_id)
# The returned resulte from our inspection confirm this and lists 44 "station id"s with more than one station_name that makes 89 row values in total.
dim(all_trips_station1)
n_distinct(all_trips_station1$start_station_id)
cat("\n\n")
head(all_trips_station1, 3)
tail(all_trips_station1, 3)
cat("\n\n")
head(all_trips_station2, 3)
tail(all_trips_station2, 3)
# The returned result above shows there are some "station name"s with extra marks (*) and repeated word (temp)
# in most of the names they seems to be irrelevant in the station names. Lets omit those extras.
all_trips_v2 <- all_trips_v1 %>%
mutate(end_station_name = str_replace_all(end_station_name, "\\(\\*\\)|\\(Temp\\)", ""),
start_station_name = str_replace_all(start_station_name, "\\(\\*\\)|\\(Temp\\)", ""),
end_station_name = trimws(end_station_name),
start_station_name = trimws(start_station_name))
# While running the following code to check if unique values in station_id = station_name,
## the returned tibble show there are still some names which need to be renamed
## so that each unique station name corresponds to one unique station id.
all_trips_station3 <- all_trips_v2 %>%
group_by(start_station_id, start_station_name) %>%
summarise(station_count = n(), .groups = "drop") %>%
group_by(start_station_id) %>%
filter(n_distinct(start_station_name) > 1, start_station_id > 1) %>%
arrange(start_station_id)
all_trips_station4 <- all_trips_v2 %>%
group_by(end_station_id, end_station_name) %>%
summarise(station_count = n(), .groups = "drop") %>%
group_by(end_station_id) %>%
filter(n_distinct(end_station_name) > 1, end_station_id > 1) %>%
arrange(end_station_id)
# "station id"s with more than one station_name`s reduced to 12 from 44 unique id. and the total rows is now 25.
dim(all_trips_station3)
n_distinct(all_trips_station3$start_station_id)
head(all_trips_station3)
tail(all_trips_station3)
cat("\n\n")
head(all_trips_station4)
tail(all_trips_station4)
# To get ride of such inconsistences in unique values between station id and station name, lets rename those values in the station_names.
# Since there are two different unique station names for a unique station_id, we rename the station names
# with low frequency from station names with high frequency.
# Next we also remove existing spaces in the names.
all_trips_station_cleaned <- all_trips_v2 %>%
mutate(
end_station_name = recode(end_station_name,
"Throop (Loomis) St & Taylor St" = "Loomis St & Taylor St",
"Throop St & Taylor St" = "Loomis St & Taylor St",
"Racine Ave (May St) & Fulton St" = "Elizabeth (May) St & Fulton St",
"Ashland Ave & 21st St" = "Laflin St & Cullerton St",
"Halsted St & 104th St" = "Western Ave & 104th St",
"Franklin St & Adams St" = "Franklin St & Quincy St",
"McClurg Ct & Illinois St" = "New St & Illinois St",
"Laflin St & Cullerton St" = "Ashland Ave & 21st St",
"Burling St (Halsted) & Diversey Pkwy"= "Burling St & Diversey Pkwy",
"Drake Ave & Fullerton Ave" = "St. Louis Ave & Fullerton Ave",
"Ashland Ave & 73rd St" = "Ashland Ave & 74th St",
"Jeffery Blvd & 91st St" = "Avenue O & 134th St",
"Michigan Ave & 114th St" = "Halsted St & 104th St"),
start_station_name = recode(start_station_name,
"Throop (Loomis) St & Taylor St" = "Loomis St & Taylor St",
"Throop St & Taylor St" = "Loomis St & Taylor St",
"Racine Ave (May St) & Fulton St" = "Elizabeth (May) St & Fulton St",
"Ashland Ave & 21st St" = "Laflin St & Cullerton St",
"Halsted St & 104th St" = "Western Ave & 104th St",
"Franklin St & Adams St" = "Franklin St & Quincy St",
"McClurg Ct & Illinois St" = "New St & Illinois St",
"Laflin St & Cullerton St" = "Ashland Ave & 21st St",
"Burling St (Halsted) & Diversey Pkwy"= "Burling St & Diversey Pkwy",
"Drake Ave & Fullerton Ave" = "St. Louis Ave & Fullerton Ave",
"Ashland Ave & 73rd St" = "Ashland Ave & 74th St",
"Jeffery Blvd & 91st St" = "Avenue O & 134th St",
"Michigan Ave & 114th St" = "Halsted St & 104th St",
"Dodge Ave & Main St" = "Chicago Ave & Dempster St"),
end_station_name = trimws(end_station_name),
start_station_name = trimws(start_station_name)
)
# A change has made to the entire dataset so we rename our main dataset to the next version.
all_trips_v3 <- all_trips_station_cleaned
# After cleaning station names, with returned below result we again notice two names from start/end_station_names remain uncleaned.
# we apply the previous renaming opproach to get ride of them too.
all_trips_station5 <- all_trips_v3 %>%
group_by(end_station_id, end_station_name) %>%
summarise(station_count = n(), .groups = "drop") %>%
group_by(end_station_id) %>%
filter(n_distinct(end_station_name) > 1, end_station_id > 1) %>%
arrange(end_station_id)
head(all_trips_station5)
tail(all_trips_station5)
all_trips_station6 <- all_trips_v3 %>%
group_by(start_station_id, start_station_name) %>%
summarise(station_count = n(), .groups = "drop") %>%
group_by(start_station_id) %>%
filter(n_distinct(start_station_name) > 1, start_station_id >1) %>%
arrange(start_station_id)
head(all_trips_station6)
tail(all_trips_station5)
# Here we remame the remained values in the start/end_station_names.
all_trips_station_cleaned2 <- all_trips_v3 %>%
mutate(
end_station_name = recode(end_station_name,
"Ashland Ave & 21st St" = "Laflin St & Cullerton St",
"Halsted St & 104th St" = "Western Ave & 104th St"
),
start_station_name = recode(start_station_name,
"Ashland Ave & 21st St" = "Laflin St & Cullerton St",
"Halsted St & 104th St" = "Western Ave & 104th St"
),
end_station_name = trimws(end_station_name),
start_station_name = trimws(start_station_name))
# We reasign the newly created and assigned data set to the main data set name to keep consistency in main dataset names
# but just changing their versions.
all_trips_v4 <- all_trips_station_cleaned2
# Now, we check and notice that the unique start/end_station_name are approaximately equal to start/end_station_id's unique values.
tibble(start_station_id = n_distinct(all_trips_v4$start_station_id),
start_station_name = n_distinct(all_trips_v4$start_station_name),
end_station_id = n_distinct(all_trips_v4$end_station_id),
end_station_name = n_distinct(all_trips_v4$end_station_name))
#Lets further inspect if there are any NAs or extra spaces "" in the start/end_station_id and Start/end_station_names and then correct them.
all_trips_v4 %>% filter(is.na(start_station_name)|
is.na(start_station_id) |
is.na(end_station_name) |
is.na(end_station_id)) %>%
select(start_station_id, start_station_name, end_station_id, end_station_name)
all_trips_v4 %>% filter(start_station_name == "" |
start_station_id == "" |
end_station_name == "" |
end_station_id == "") %>%
select( start_station_id, start_station_name, end_station_id, end_station_name)
# from the emnpty tibble we see there are not NAs or "".
# let Look at member_casual and rideable_type variable catagories:
# In the member_casual column, there are 4 types of categories that should be just member and casual.
# The rideable_type column is consiss of manay types of rides which are indicated in codes. We don`t consider these categories unless it few in number.
unique(all_trips_v4$member_casual)
head(unique(all_trips_v4$rideable_type), 20)
all_trips_v4 <- all_trips_v4 %>%
mutate(member_casual = ifelse(member_casual == "Subscriber", "member",
ifelse(member_casual == "Customer", "casual", member_casual)))
In these step, we perform manipulation and aggregation on data to organize necessary part of the data for analysis through particular tibble and visualisation. This will make it easy to descover trends and relationships in both members' and casuals' number of ridings and average ride time. We will also be understand how these two type customers differ in term of using the service on the basis of year, month, week and hours. The aim is to find answer to the business question.
* We add some relavant new variables/columns.
* then format the new variable with the propare type
#Lets manipulate started_at column and generate new columns (year, months, day, minutes, day_of_week).
all_trips_v4$date <- as.Date(all_trips_v4$started_at)
all_trips_v4$year <- format(as.Date(all_trips_v4$started_at), "%Y")
all_trips_v4$month <- format(as.Date(all_trips_v4$started_at), "%M")
all_trips_v4$day <- format(as.Date(all_trips_v4$started_at), "%D")
all_trips_v4$day_of_week <- format(as.Date(all_trips_v4$started_at), "%A")
all_trips_v4$minutes <- format(as.POSIXct(all_trips_v4$started_at), "%M")
# We also add ride_duration/ride_length column: it shows how long a member or a casual spent time
# on each ride for a specidic period of time (hour, minutes, second)
all_trips_v4$ride_length <- difftime(all_trips_v4$ended_at, all_trips_v4$started_at)
# We check the dimention and column names of the dataset if the mentioned columns are generated.
dim(all_trips_v4)
colnames(all_trips_v4)
cat("\n\n")
# Apply the str() function to observe any anamolies in the structure of the data set.
# After look through the each columns and their structures, we notice that ride_length is not in proper formate that should be converted to numeric (num) data type.
str(all_trips_v4)
# We notice that the "ride_length" is not in proper format so convert it to numeric.
all_trips_v4$ride_length <- as.numeric(as.character(all_trips_v4$ride_length))
# Lets confirm if the conversion is done correctly
summary(all_trips_v4$ride_length)
cat("\n\n")
str(all_trips_v4$ride_length)
#Though data type conversion is done correctly but there are enormous negative values in ride_lenght column.
#After navigating it further, We found out those negative values (HQ QR) are related to the rides done for quality and maitanance check.
all_trips_negValue0 <- all_trips_v4 %>%
select(start_station_name, ride_length) %>%
filter(ride_length < 0, start_station_name == "HQ QR")
all_trips_negValue1 <- all_trips_v4 %>%
filter(ride_length < 0, start_station_name == "HQ QR") %>%
group_by(start_station_name) %>%
summarise(count_negride_length = sum(ride_length <0, na.rm = TRUE),
total_ride_length = n())
dim(all_trips_negValue0)
head(all_trips_negValue0)
all_trips_negValue1
#Now, remove the irralevant negative values.
all_trips_v5 <- all_trips_v4[
!(all_trips_v4$start_station_name == "HQ QR" | all_trips_v4$ride_length < 0),]
all_trips_negValue1 <- all_trips_v5 %>%
select(start_station_name, ride_length) %>%
filter(ride_length < 0)
#lets check if those values are omitted or not
all_trips_negValue1
head(all_trips_v5, 3)
Through performing the analysis and visualization we convey the message (insights) achieved from the analysis
to the stakeholder.
* Form tibbles from the variables that is required for particular visualisations.
* We apply visualisation on those tibble (a summary tibble) to illustrate the differences through trend
and relationships.
# Data set aggregation on the bases of month and year: in the resulted tibble you see:
## number of ride done each year
## statistics related to rides in each year
all_trips_aggregation <- all_trips_v5 %>%
group_by(member_casual, year) %>%
summarise(ride_count = n(),
mean_ride= round(mean(ride_length/60, na.rm = TRUE), 2),
median_ride = round(median(ride_length/60, na.rm = TRUE), 2),
min_ride = round(min(ride_length/60, na.rm = TRUE), 2),
max_ride = round(max(ride_length/60, na.rm = TRUE), 2), .groups = "drop") %>%
mutate(percentage = round(ride_count/ sum(ride_count)*100, 2))
all_trips_aggregation
#Counting number of rides done by members and casual and its percentage on the basis of year.
all_trips_year <- all_trips_v5 %>%
group_by(member_casual) %>%
summarise(ride_count = n(), .groups = "drop") %>%
mutate(percentage = round(ride_count/ sum(ride_count)*100, 2))
all_trips_year
# We upload library for richer visualization
library(ggplot2)
library(scales) # "scale" is uploaded to customise the numbers on y axis
library(gridExtra) # "gridExtra" is uploaded to customise the plot displays
library(grid)
grid.newpage()
options(repr.plot.width = 12, repr.plot.height = 5)
# in the below plots:
## "geom_bar" is applied to visualise number of rides members and casuals have done each year
## "coord_polar" is applied to display a pie chart dislplaying number of rides/ride trips in percentage
ride_countyear <- ggplot(data = all_trips_aggregation) +
geom_bar(mapping = aes(x = member_casual, y = ride_count, fill = year),
stat = "identity", position = "dodge") +
labs(
title = "Ride trips done by year",
x = "year",
y = "ride_count"
) +
theme_minimal()+
scale_y_continuous(labels = label_number(accuracy = 0.01))
ride_countyrptg1 <- ggplot(data = all_trips_year,
mapping = aes(x = "", y = ride_count, fill = member_casual)) +
geom_bar(stat = "identity", width = 0.5) +
coord_polar("y")+
geom_text(aes(label = paste0(percentage, "%")),
position = position_stack(vjust = 0.5),
color = "yellow")+
labs(title = "Number of ride in percentage")+
theme_void()
grid.arrange( ride_countyear, ride_countyrptg1, ncol = 2)
The bar char displays number of rides and the pie chart shows this numbers in percentage for the years 2019 and 2020.
- The bar chart reflects increased ride trips in 2020 then 2019 for both riders.
* What is more important is the shift in the number of rides done by casuals which increased three time than
last year and a one third increase then members in 2020.
* It means member did not ride more bike than casual in 2020 (which was opposit in the previous year),but
it can certainly be said it increased nearly two times than last year.
- The pie chart show the percentage of ride numbers by member casual in bother together.
* It seem that number of rides done by members (69.77%) are considerably higher than casuals (30.23%).
# Dataset is aggregated on the bases of month and year displaying member_casual's riding time behavior or difference:
# in the returned tibble you can see:
## mean_ride done each each month or through out the four season of the year.
## number of rides / ride trips done each month
month_meanrides <- all_trips_v5 %>% mutate(month = month(started_at, label = TRUE)) %>%
group_by(member_casual, year, month) %>%
summarise(mean_ride = mean(ride_length/60, na.rm = TRUE),
ride_count = n(), .groups = "drop")
dim(month_meanrides)
head(month_meanrides, 3)
tail(month_meanrides, 3)
options(repr.plot.width = 9, repr.plot.height = 5)
month_meanride <- ggplot(data = month_meanrides) +
geom_bar(mapping = aes(x = month, y = mean_ride, fill = member_casual),
stat = "identity", position = "dodge") +
facet_grid( member_casual ~ year ) +
labs(
title = "average ride time spent each month by members and casuals",
x = "weekdays",
y = "mean_ride"
) +
theme_minimal()+
scale_y_continuous(labels = label_number(accuracy = 0.01))
month_countride <- ggplot(data = month_meanrides) +
geom_bar(mapping = aes(x = month, y = ride_count, fill = member_casual),
stat = "identity", position = "dodge") +
facet_grid( member_casual ~ year ) +
labs(
title = "Number of rides done each month by members and casuals",
x = "month",
y = "ride_count"
) +
theme_minimal()+
scale_y_continuous(labels = label_number(accuracy = 0.01))
month_meanride
month_countride
These two plots are generated based on the average ride time (mean_ride) and number of rides(ride_length) on each month along the whole year 2019 and 2020 or on the four seasons of the year.
Overall, it looks like average ride lenght/time is nearly three times bigger for casuals then members.
- The first plot with red color on the top shows stable average rides time for **casuals** in the 2019
* but with extremly high everage ride time on February which is almost three time larger then the rest of the months.
* this numbers changed in the next year(2020)showing the highest everage ride time spent using the bikes in the beging
of the year or in the winter (Januray and February).
* Following this season,this number decreased nearly two times in the spring on the March and April.
* it kept decreasing steadliy till the end of the year and reached to its lowest point on december.
- However, The everage ride time, **for members** who spent less time using the bikes, were nearly constant in
the year 2020. In particular, members' highest everage ride length was in the spring (April and May) and the
begining of fall (Jun and Jully).
In general, the second plot illustrates similar pattern of ride trips done by both member and casuals although there
are some difference in the number of their rides along the year.
- The plot with red colore illustrate ride trips or ride_count for casuals in both year with relative increase in 2020.
* The period when casuals rides the most started from the end of spring (May),reached to its the highest ride numbers
in the end of summer (August) and reduced till the end of Fall (November).
* the lowest number of rides is in winter which is insignificant mean that casuals are not willing to use the bike.
- By contrast, members did more number of rides the whole year on each month than casual.
* In 2019, the most number of ridings started from the mid-spring (April), got to its pick on August and reduced
till mid-Fall (October).
* Despite having nearly the same pettern and relatively decreased number of ride in 2020,members experienced their
lowest ride trips in the mid-spring (April) in 2020.
# # Data set is aggregated on the bases of weekday and year displaying member_casual's riding time pattern.
# in the result tibble you can see:
## mean_ride done each each day or through out the whole week in each year.
## number of rides / ride trips done each day.
weekday_ride <- all_trips_v5 %>%
mutate(weekday = wday(started_at, label = TRUE))%>%
group_by(member_casual, year, weekday, ride_length) %>%
summarise( ride_count = n(), .groups = "drop_last")
head(weekday_ride, 3)
options(repr.plot.width = 12, repr.plot.height = 6)
# Data set is aggregated on the bases of weekday and year displaying member_casual's riding time pattern.
# in the result tibble you can see:
## mean_ride done each each day or through out the whole week in each year.
## number of rides / ride trips done each day.
weekday_meanride <- all_trips_v5 %>%
mutate(weekday = wday(started_at, label = TRUE))%>%
group_by(member_casual, year, weekday) %>%
summarise(mean_ride = round(mean(ride_length/60, na.rm = TRUE), 2),
ride_count = n(), .groups = "drop_last")
dim(weekday_meanride)
head(weekday_meanride)
# To better understand the riding pattern of the members and casual,
# it is suitbale to consider the mean_ride on the visualization.
# In the visualisation below:
## the first plot displays the avarage/mean rides time spend by member and casuals each day.
## the second plot displays number of ride/ ride trips/ frequency of rides done each day.
options(repr.plot.width = 8, repr.plot.height = 4)
ggplot(data = weekday_meanride) +
geom_bar(mapping = aes(x = weekday, y = mean_ride, fill = member_casual),
stat = "identity", position = "dodge") +
facet_grid(member_casual ~ year)+
labs(
title = "Average ride time spent each day by members and casuals",
x = "weekdays",
y = "mean_ride"
) +
theme_minimal() +
scale_y_continuous(labels = label_number(accuracy = 0.01))
ggplot(data = weekday_meanride) +
geom_bar(mapping = aes(x = weekday, y = ride_count, fill = member_casual),
stat = "identity", position = "dodge") +
facet_grid(member_casual ~ year)+
labs(
title = "Number of rides done each day by members and casuals",
x = "weekdays",
y = "ride_count"
) +
theme_minimal()+
scale_y_continuous(labels = label_number(accuracy = 0.01))
Both plots compares mean ride time spent each day and number of rides done by both member and casuals.
- The first plot displays the mean ride time of casuals who spent almost four time longer than members.
* In 2020, mean ride time, which was larger in the middle of the week in 2019,decreased and it is larger
in the weekends (saturday and sanday) with minor reduction then previous year)
- In contrast,the mean ride time remained stable for members in both years with slightly larger mean rides
on the weekends.
_ In the second plot,the frequency of rides for casual has almost the same pattern in both years with a relatively
small increase in the second year.
* Accordingly, casuals use more bikes on the weekends which is low on the other week days. It means casuals
use bikes more for leisures in the weekends. This pattern is complately opposit for members.
* Member mostly used bikes on the weekdays in both year otherthan weekends which must be for jobs or tasks
that needed to be done.
* Another point worth to be noted is that number of times members used bikes were mostly in the middle
of the week in 2019 which reduced in the 2020 and got more stable during the week.
* On top of everything, members' ride trips is near three time higher than that of casuals
# Aggregation on the bases of each hours in 24 hours displaying member_casual's riding time and bike usage pattern.
# in the result tibble you can see:
## mean_ride done each each day or during 24 hours.
## number of rides / ride trips done each hour.
hours_meanride <- all_trips_v5 %>%
mutate( hour = format(as.POSIXct(started_at), "%H")) %>%
group_by(member_casual, year, hour) %>%
summarise(meanride = mean(ride_length/60, na.rm = TRUE),
ride_count = n(), .groups = "drop_last")
dim(hours_meanride)
head(hours_meanride, 3)
tail(hours_meanride, 3)
# ride length or time of the ride is calculated in minutes.
min(hours_meanride$meanride[hours_meanride$member_casual == "member"])
max(hours_meanride[hours_meanride$member_casual =='member','meanride'])
mean(hours_meanride$meanride[hours_meanride$member_casual == "member"])
cat("\n")
min(hours_meanride$meanride[hours_meanride$member_casual == "casual"])
max(hours_meanride$meanride[hours_meanride$member_casual == "casual"])
mean(hours_meanride$meanride[hours_meanride$member_casual == "casual"])
# In the visualisation below:
## the first plot displays the avarage/mean rides time spend by member and casuals each hour.
## the second plot displays number of ride/ ride trips/ frequency of rides done each hour.
hour_mride <- ggplot(data = hours_meanride) +
geom_bar(mapping = aes(x = hour, y = meanride, fill = member_casual),
stat = "identity", position = "dodge") +
facet_grid(member_casual ~.) +
labs(
title = "Average ride time spent by member and casuals each hour",
x = "hour",
y = "mean_ride"
) +
theme_minimal()+
scale_y_continuous(labels = label_number(accuracy = 0.01))
hour_cride <- ggplot(data = hours_meanride) +
geom_bar(mapping = aes(x = hour, y = ride_count, fill = member_casual),
stat = "identity", position = "dodge") +
facet_grid( member_casual ~.) +
labs(
title = "Number of ride done by member and casuals each hour",
x = "hour",
y = "ride_count"
) +
theme_minimal()+
scale_y_continuous(labels = label_number(accuracy = 0.01))
hour_mride
hour_cride
Both plot display average ride time (mean_ride) and number of ride (ride_count) done by member and casuals each hour
during 24 hours.
The first plot visualise average ride time by both member and casual:
- It seems that casual riders on average spent more time in the midnight from 12am to 5am with the largest average ride
time in 3am and then 4am while their average
spent time is stable for the rest hours.
_ On the contrast, member's average ride time of using the bikes is almost stable in 24 hours but with moderate increase
in their average ride time in the midnight from 2am to 3am.
The second plot demonstrate how many time the member and casual used bikes each hour during in the 24 hours.
- the bar chart at the top counts the ride trips for each hour done by the casuals.
* In particular, the usage of bikes started increasing from 6am, reached its pick at 5pm and gain reduction till 12am.
* It means that the most usege of bike was during the pariod from 12pm to 7pm.
_ On the other hand, number of trips done by member were fluctuating and greater than casuals in general.
* It means their number of trips got two time at their pick: first on 8am and then 7pm.
* Their riding periods, when they ride the most, is from 6am till 10pm.
* Addintionally, rideing is nearly stable in this period except (7am - 8am) and (4pm to 6pm).
# Generate two summary tibbles that is based on "member_casual" and "routes".
# with another variable (count_station), It is to find the most used routes by members and casual.
stations_route1<- all_trips_v5 %>%
mutate(routes = str_c(start_station_name, "->", end_station_name)) %>%
group_by(member_casual, routes) %>%
filter(member_casual == "casual") %>%
summarise(count_stations = n(), .groups = "drop") %>%
arrange (desc(count_stations)) %>%
slice_head(n= 20)
head(stations_route1)
stations_route2<- all_trips_v5 %>%
mutate(routes = str_c(start_station_name, "->", end_station_name)) %>%
group_by(member_casual, routes) %>%
filter(member_casual == "member") %>%
summarise(count_stations = n(), .groups = "drop") %>%
arrange (desc(count_stations)) %>%
slice_head(n= 20)
dim(stations_route2)
head(stations_route2)
tail(stations_route2)
options(repr.plot.width = 10, repr.plot.hight = 5)
# Creating a visualisation that display most used routes
ggplot(data = stations_route1, mapping = aes( x = reorder(routes, count_stations), y = count_stations,
fill = count_stations)) +
geom_bar(stat = "identity", position = "dodge", show.legend = TRUE)+
coord_flip()+
labs(
title = "Top 20 Most Frequent Routes for casual riders",
x = "Route",
y = "Number of Rides"
) +
theme_minimal()
cat("
Description:
Count_station_routes (casuals)
_ this plot display top 20 routes by order used by casuals.
_ It seem the most of the routs goes to recreational locations
\n")
ggplot(data = stations_route2, mapping = aes( x = reorder(routes, count_stations), y = count_stations,
fill = count_stations)) +
geom_bar(stat = "identity", position = "dodge", show.legend = TRUE)+
coord_flip() +
labs(
title = "Top 20 Most Frequent Routes for member riders",
x = "Route",
y = "Number of Rides"
) +
theme_minimal()
cat("
Description:
Count_station_routes (members)
this plot display top 20 routes by order used by members
\n")
stations_route_top20 <- all_trips_v5 %>%
mutate(routes = str_c(start_station_name, "->", end_station_name)) %>%
group_by(member_casual, year, routes) %>%
summarise(count_stations = n(), .groups = "drop") %>%
group_by(member_casual) %>%
slice_max(count_stations, n = 10) %>%
ungroup() %>%
arrange (desc(count_stations))
head(stations_route_top20)
options(repr.plot.width = 12, repr.plot.height = 5)
stations_routes_top20 <- ggplot(data = stations_route_top20, mapping = aes( x = reorder(routes, count_stations),
y = count_stations, fill = member_casual )) +
geom_bar(stat = "identity", position = "dodge", show.legend = TRUE)+
coord_flip() +
labs(
title = "Top 20 Most Frequent Routes for member - casual riders",
x = "Route",
y = "Number of Rides"
) +
theme_minimal()
stations_routes_top20
The object of this case study is to answer to the business question: how do members and casuals ride the bikes differently. Based on the analysis of trend and behavior they showed through particular priod of the time (year, month, weeksays and hours) and routes, It will eneble stackholders to develope effective martketing strategy to convert casuals to members.
The bar char displays number of rides and the pie chart shows this numbers in percentage for the years 2019 and 2020.
It seems casuals' ride trips, with increased three times in 2020 than 2019,
- The bar chart reflects increased ride trips in 2020 then 2019 for both riders.
* What is more important is the shift in the number of rides done by casuals which increased three time
than last year and a one third increase then members in 2020.
* It means member did not ride more bike than casual in 2020 (which was opposit in the previous year),
but it can certainly be said it increased nearly two times than last year.
- The pie chart show the percentage of ride numbers by member casual in bother together.
* It seem that number of rides done by members (69.77%) are considerably higher than casuals (30.23%).
Overall, average ride time is nearly three time bigger for casuals than member.
Casuals, specifically, with stable maverage ride length in 2019 with just three time larger averages in february changed in 2020.
- Pich points: 89000m (January) and 7600m (february) or mid-winter
- Lowest points: 1250m (October) and 1200m (December) or during Fall and begining of winter
The everage ride time for members were nearly constant in both years but in 2020:
- Pick points: 1500m April to 1000 May in spring
- lowest points: 400m January and 450 (December) in the winter.
Both casuals and membes have almost similar patter in the ride trips with small differences:
- Most trips: May (90000t) end of spring to November (70000t) end of fall.
- Pick point: August (2800000t)
- Lowest points: December (30000) to February(20000t) in the winter
By contrast, in 2020 members did more number of rides the whole year on each month than casual.
- Most trips: Jun (190000t) less than (350000t previous year) to October (210000t) less than (400000t previous year)
or mid-spring to mid-fall.
- Pick point: August (320000t) less than (400000t prvious year).
- Lowest points: April (50000) and December (80000) to February(20000t) in the winter
Casuals spent almost four time longer than members on average during the weekdays in general.
Average ride time for Casuals during the week:
- Pick points: 56m (Sunday) and 50m (Saturday) on weekends
- Lowest points: 42m (tuesday and Wednesday) in the middle of the weeks
In contrast,the mean ride time (around 18m) remain stable for members in both years with slightly
larger mean rides on the weekends.
Frequency of rides for casual were almost the same pattern in both years with a relatively small increase
in the second year.
- Most trips: Sutarday (300000t),Sunday (250000t), and Friday (195000t)
- It means casuals use bikes more for leisures in the weekends.
In conrrast member mostly used bikes on the weekdays which is around 300000t in 2020 which is considerably reduced from
around 460000t in 2019). This means they use the service mostly for necessary task not leisures.
As a whole, Casuals spent three times more on average than members.
casuals:
- High everage points: 12am (100m) to 5am (100) at mid-night
- Pick points: 3am and 4am (190m and 140m)
- Lowest points: 7am and 8am (30m and 50m) in the morning
members:
- member's average ride time is almost stable in 24 hours (around 20m)
- but with moderate increased in the midnight from 2am to 3am (around 25m)
Casuals:
- Most trips: start 7am and decreased till 11pm:
- Highest trips - 11am to 8pm (around 100000t)
- Pick point trips: 7pm (150000t)
- lowest trips: 1am to 6am (15000t)
Members:
- Most trips: 6am to 9pm (around 140000t)
- Highest trips: First - 7am - 8am (250000t - 290000t).
Second - 4pm and 5pm and 6pm (300000t - 390000t - 250000)
- lowest trips: 12am to 4am (15000t)
Casual: Top 20 routes shown by most used one in order. It seem most of the routs goes to recreational locations.
based on the conclusion, it can be suggested that the executives and the marketing department should focus on the area or stations where casuals commutes more while using the service. They can also lay their marketing compign based on the time of rides they spent using the bikes. Fequency of trips they make by the bike can also be influential if taken into consideration in their marketing strategy. Both longer time of rides and the high frequency of trips can potentially increase the chances of converting casuals to member. These can easily be understood from different and specific time periods shown in the visualisations.