This case study aims to analyze the usage patterns of the public bike-sharing system in New York City, specifically the use of Citi Bike, throughout the year. Various factors that may influence user behavior will be explored, such as seasonality, weather conditions (like rain), and user type (frequent members vs. occasional users).
The analyses focus on the following aspects:
Seasonality: Variations in bike usage will be examined according to the seasons of the year (spring, summer, fall, and winter), evaluating factors such as trip duration and the number of users per season.
Weather Conditions: The impact of rain on bike usage will be analyzed. Usage patterns on rainy days will be compared to those on dry days, observing whether weather conditions affect the number of trips taken and their average duration.
User Type: Differences in user behavior will be studied according to user type (occasional vs. member), with the goal of identifying variations in usage frequency, trip duration, or peak activity hours between these groups.
Through these analyses, the aim is to identify relevant patterns that can help better understand user behavior and its relationship with seasonal and weather conditions, as well as usage differences based on demographic and special characteristics.
1.1 Dataset Used
For this study, the public dataset provided by Citi Bike, New York City’s bike-sharing system, has been used. The data is available on Citi Bike’s official data portal and contains detailed information about each trip made using the service.
1.2 Data Source
The downloaded files correspond to trips made during the year starting in 2024. Each month is represented by a separate CSV file, with approximately 1 million records per month. The files were downloaded, stored in a local folder named data/, and then read and combined into a single data frame for analysis.
1.3 Variables Used
The original dataset contains the following variables:
ride_id: unique identifier of the ride.
rideable_type: type of bike used (classic, electric, or dockless).
started_at: date and time when the ride started.
ended_at: date and time when the ride ended.
start_station_name: name of the starting station.
start_station_id: identifier of the starting station.
end_station_name: name of the destination station.
end_station_id: identifier of the destination station.
start_lat, start_lng: geographic coordinates of the starting location.
end_lat, end_lng: geographic coordinates of the destination.
member_casual: user type (member for subscribed users or casual for occasional users).
Note: The data does not include variables directly related to the user’s gender or age, so the analysis is limited to the user type (member_casual) as the only sociodemographic variable.
1.4 Normalization and Cleaning Process
Since the monthly files were not entirely homogeneous (for example, some columns like start_station_id appeared as type double in some files and character in others), a preliminary type normalization was performed when reading the data. Station identifiers (start_station_id and end_station_id) were forced to character type to avoid type conflicts.
Additionally, new derived variables were created to facilitate temporal analysis:
duration: ride duration in minutes, calculated as the difference between ended_at and started_at.
month: month of the year extracted from the start date.
day_of_week: day of the week on which the ride took place.
season: season of the year, classified according to the month (Winter, Spring, Summer, Fall).
Records with negative durations or durations over 24 hours were also removed, as they were considered measurement errors or outliers.
2 Final Dataset Size
After preprocessing and merging the files, the resulting dataset contains approximately 44 million records, enabling a statistically representative and robust analysis of Citi Bike usage throughout the year 2024.
2.1 Column Compatibility Check
Before merging the CSV files, it is important to ensure that all files have the same columns. This guarantees that the data is combined correctly without errors related to incompatible columns.
2.1.1 Listing the Columns of Each File
The process began by listing the headers of each CSV file to identify the columns present in each one. This was done using the read_csv() function with the argument n_max = 0, which allows reading only the headers without loading the full data.
Show code
options(repos =c(CRAN ="https://cloud.r-project.org"))install.packages("purrr")install.packages("readr")library(purrr)library(readr)# List CSV files in the data folderfiles <-list.files(path ="data/", pattern ="*.csv", full.names =TRUE)# Read the headers of each file, suppressing the column type messagecolumn_lists <-map(files, ~names(read_csv(.x, n_max =0, show_col_types =FALSE)))# View the columns of each filenames(column_lists) <-basename(files)column_lists
2.1.2 Merging Columns
The column merging process aims to unify multiple CSV files into a single, clean, and consistent dataset. First, the files are loaded from the “data” folder and duplicate columns are removed. Then, the common columns across all files are identified to ensure they all share the same structure. If any columns are missing in some files, they are added with NA values and reordered to match the common columns. Additionally, station identifier columns are standardized to have consistent data types. Finally, the standardized files are combined into a single dataset using bind_rows(), ensuring that rows from each file are appended consistently.
Show code
# Load necessary librarieslibrary(dplyr)library(readr)library(tibble)library(purrr)# Define the set of CSV files to readfiles <-list.files(path ="data/", pattern ="*.csv", full.names =TRUE)# Check if the files existif(length(files) ==0) {stop("No CSV files found in the specified path.")}# Step 1: Read the files and store the columnscleaned_files <-map(files, function(file) { data <-read_csv(file, show_col_types =FALSE) # Suppress column type message# Remove duplicate columns, if any (e.g., 'rideable_type_duplicate_column_name_1') data <- data %>%select(-contains("duplicate_column_name"))return(data)})# Check the columns of the cleaned filescleaned_column_lists <-map(cleaned_files, names)# Step 2: Identify common columnsif (length(cleaned_column_lists) >0) { common_cols_cleaned <-reduce(cleaned_column_lists, intersect)} else {stop("No common columns found among the files.")}# Show common columnsprint("Common columns across all files:")print(common_cols_cleaned)# Step 3: Ensure all files have the same common columns and consistent data typesstandardized_files <-map(cleaned_files, function(data) { missing_cols <-setdiff(common_cols_cleaned, names(data))# Add missing columns with NA values data <- data %>%add_column(!!!setNames(rep(list(NA), length(missing_cols)), missing_cols), .after =0)# Reorder the columns according to the common columns data <- data %>%select(all_of(common_cols_cleaned))# Convert columns to consistent types# Convert 'start_station_id' and 'end_station_id' to character in all files data <- data %>%mutate(across(starts_with("start_station_id"), as.character),across(starts_with("end_station_id"), as.character) )return(data)})# Step 4: Combine the filescitibike_data_cleaned <-bind_rows(standardized_files)# Show the first rows of the combined datasethead(citibike_data_cleaned)# Save the cleaned combined file to a CSV# write_csv(citibike_data_cleaned, "citibike_combined_cleaned.csv")# Summary of the cleaning and merging process# summary(citibike_data_cleaned)
2.1.3 Block-wise Processing and Data Transformation
In this step, the procesar_en_bloques function is defined to efficiently handle large volumes of data. Since the dataset can be very large, it is divided into blocks of a specified size (by default, 1 million rows). The function iterates through each data block, applying several transformations: it converts the start and end times of each ride to POSIX format, calculates the ride duration in minutes, extracts the month and day of the week, and assigns a season (spring, summer, fall, winter) based on the start month. The results of each block are stored and then combined using bind_rows(). Finally, memory is freed using gc() to optimize the performance of the process.
Show code
library(tidyverse)library(lubridate)library(fasttime)rm(cleaned_files)# Function to process large datasets in blocksblock_processing <-function(data, block_size =1e6) { n <-nrow(data) blocks <-ceiling(n / block_size) result <-vector("list", blocks)for (i inseq_len(blocks)) {#cat("Processing block", i, "of", blocks, "...\n") start_time <-Sys.time() start <- ((i -1) * block_size) +1 end <-min(i * block_size, n) block <- data[start:end, ] %>%mutate(started_at =fastPOSIXct(started_at),ended_at =fastPOSIXct(ended_at),duration =as.numeric(difftime(ended_at, started_at, units ="mins")),month =month(started_at, label =TRUE),day_of_week =wday(started_at, label =TRUE),season =case_when(month(started_at) %in%c(12, 1, 2) ~"Winter",month(started_at) %in%c(3, 4, 5) ~"Spring",month(started_at) %in%c(6, 7, 8) ~"Summer",month(started_at) %in%c(9, 10, 11) ~"Fall" ) ) result[[i]] <- block end_time <-Sys.time() duration <-round(difftime(end_time, start_time, units ="secs"), 2)#cat("Block", i, "processed in", duration, "seconds.\n\n") }bind_rows(result)}# Apply the function to the cleaned Citi Bike datasetcitibike_data_cleaned <-block_processing(citibike_data_cleaned, block_size =1e5)# Free memorygc()
2.2 Data Load Verification
In this section, we verify the number of rows in each CSV file to ensure that all files have been loaded correctly. The map_int() function is used to count the rows of each file, and the result is printed for review.
Show code
# Verify the number of rows in each CSV file read, suppressing the column specification messagetibble(file = files) %>%mutate(rows =map_int(file, ~nrow(read_csv(.x, show_col_types =FALSE)))) %>%print(n =Inf)
3 Exploratory Data Analysis
3.1 Data Summary
This section provides an overview of the cleaned dataset using the glimpse() function, which displays the structure and the first few rows of the data. Additionally, descriptive statistics for the duration variable are calculated to gain a basic understanding of the distribution of bike trip durations in the system.
Show code
# Data Overviewglimpse(citibike_data_cleaned)# Descriptive Statistics for Trip Durationsummary(citibike_data_cleaned$duration)
3.2 Number of trips by season of the year
In this section, the number of trips made in the bicycle system by season of the year is analyzed. The ‘season’ variable is transformed into a factor with the correct order of the seasons (Spring, Summer, Autumn, Winter) and then the number of trips for each season is counted using count(). The results are visualized in a bar chart (geom_col()), with the X-axis representing the seasons and the Y-axis the number of trips.
The bar chart clearly shows the distribution of bicycle trips across the four seasons. It is evident that Summer has the highest number of bicycle trips, followed by Autumn and then Spring. Winter has the lowest number of trips by a significant margin compared to the other seasons. This pattern is expected, as warmer weather and longer daylight hours in Summer and Autumn are generally more conducive to cycling than the colder, shorter days of Winter and early Spring.
Show code
citibike_data_cleaned %>%mutate(season =factor(season, levels =c("Spring", "Summer", "Fall", "Winter"))) %>%count(season) %>%ggplot(aes(x = season, y = n, fill = season)) +geom_col() +labs(title ="Number of Trips by Season", x ="Season", y ="Number of Trips") +theme_minimal()
3.3 Distribution of trips by user type
In this section, the distribution of trips by user type is analyzed, differentiating between registered members and occasional users (member_casual). The number of trips made by each user type is counted using count(), and the results are visualized using a bar chart (geom_col()), with the X-axis representing the user types and the Y-axis the number of trips made. This allows observing how trips are distributed between frequent and sporadic users.
The bar chart illustrates the distribution of trips between casual and member users. It shows that the number of trips made by member users is significantly higher than the number of trips made by casual users. This indicates that registered members constitute the majority of users utilizing the bicycle system for trips.
Show code
citibike_data_cleaned %>%count(member_casual) %>%ggplot(aes(x = member_casual, y = n, fill = member_casual)) +geom_col() +labs(title ="Distribution of Trips by User Type", x ="User Type", y ="Number of Trips") +theme_minimal()
3.4 Average trip duration by user type
In this analysis, the average duration of trips is evaluated based on the user type, differentiating between registered members and occasional users (member_casual). The average trip duration is calculated grouped by user type and then visualized using a bar chart. The chart allows observing the differences in trip duration between frequent and sporadic users, which can be useful for understanding the usage patterns of the bicycle system.
The accompanying bar chart visually represents the average trip duration for casual versus member users. It clearly shows that casual users have a significantly longer average trip duration compared to member users. The bar for casual users is considerably taller, indicating their trips tend to last for more minutes on average than those taken by registered members. This suggests that casual users might use the bikes for longer recreational rides or one-off journeys, while members may use them more frequently for shorter commutes or regular trips.
Show code
citibike_data_cleaned %>%group_by(member_casual) %>%summarise(avg_duration =mean(duration, na.rm =TRUE)) %>%ggplot(aes(x = member_casual, y = avg_duration, fill = member_casual)) +geom_col() +labs(title ="Average Trip Duration by User Type", x ="User Type", y ="Average Duration (minutes)") +theme_minimal()
3.5 Analysis of bike type
In this analysis, the number of trips is analyzed based on the type of bicycle. The code groups the data by bike type, counts the number of trips for each type, and then creates a bar chart to visualize these counts.
Tbe chart shows the distribution of trips between different bicycle types. From the chart, it is evident that electric bikes are used for a significantly higher number of trips compared to classic bikes. The bar representing electric bikes is substantially taller than the bar for classic bikes, indicating a strong preference for, or greater availability/usage of, electric bicycles within the system.
Show code
library(dplyr)library(ggplot2)# Group by bike type and countbike_type_count <- citibike_data_cleaned %>%filter(!is.na(rideable_type)) %>%count(rideable_type, sort =TRUE)ggplot(bike_type_count, aes(x =reorder(rideable_type, n), y = n, fill = rideable_type)) +geom_col() +labs(title ="Analysis of bike type", x ="Bike type", y ="Number of trips") +theme_minimal() +scale_fill_viridis_d()
4 Detailed Temporal Analysis
4.1 Monthly Trend of Trips
The line graph illustrates a clear seasonal pattern in the number of Citi Bike trips for both casual and member users over the observed year. Both user types show a significant increase in ridership starting from the low point in January, peaking during the warmer months (approximately July through October), and then declining sharply towards the following January. Member users consistently account for a substantially higher number of trips each month compared to casual users, reinforcing their role as the primary users of the system. The peak in member trips is considerably higher than the peak for casual users, and the magnitude of monthly fluctuation appears more pronounced for members. This suggests that while both groups are influenced by seasonal factors, registered members utilize the service much more extensively throughout the year, particularly during favorable weather conditions.
Show code
library(dplyr)library(ggplot2)library(lubridate)# Group by month and user typemonthly_trips <- citibike_data_cleaned %>%mutate(month =floor_date(started_at, "month")) %>%group_by(month, member_casual) %>%summarise(total_trips =n(), .groups ="drop")# Plotggplot(monthly_trips, aes(x = month, y = total_trips, color = member_casual)) +geom_line(size =1.2) +labs(title ="Monthly trend of trips in Citi Bike",x ="Month",y ="Number of trips",color ="User type" ) +theme_minimal()
4.2 Trips by Day of the Week
The bar chart illustrating trips by day of the week reveals distinct usage patterns between member and casual users. Member users consistently show a higher volume of trips throughout the week, with peak ridership occurring on weekdays from Monday to Friday. Their trip numbers see a noticeable decrease during the weekend (Saturday and Sunday). This pattern strongly suggests that members primarily utilize the bike-sharing service for regular weekday activities, likely commuting or running errands.
In contrast, casual users exhibit the opposite trend, with lower trip numbers during the weekdays and a significant surge in ridership over the weekend, peaking on Saturday. This pattern indicates that casual users are more inclined to use the bikes for leisure activities, recreation, or tourism, which are typically pursued during days off. The clear divergence in weekday vs. weekend usage between the two user types highlights their different motivations and use cases for the bike-sharing system.
Show code
# Extract weekdaycitibike_data_cleaned <- citibike_data_cleaned %>%mutate(weekday =wday(started_at, label =TRUE, week_start =1))# Groupweekday_trips <- citibike_data_cleaned %>%group_by(weekday, member_casual) %>%summarise(total_trips =n(), .groups ="drop")# Plotggplot(weekday_trips, aes(x = weekday, y = total_trips, fill = member_casual)) +geom_col(position ="dodge") +labs(title ="Trips by Day of the Week",x ="Day",y ="Number of trips",fill ="User type" ) +theme_minimal()
4.3 Hourly Distribution of Trips
The bar chart displaying the hourly distribution of trips reveals distinct daily usage patterns between member and casual users. Member users exhibit a clear bimodal distribution, with significant peaks in ridership during typical morning commute hours (around 8-9 AM) and evening commute hours (around 5-6 PM). This pattern is characteristic of individuals using the bike service for regular daily travel, such as commuting to and from work or school.
In contrast, casual users show a more spread-out distribution throughout the day, with their ridership gradually increasing from the morning, peaking over a broader period in the afternoon and early evening (roughly 12 PM to 6 PM), and then declining. They do not display the sharp, distinct commute-hour peaks seen among members. This suggests that casual users utilize the bikes for a wider variety of activities throughout the day, likely including leisure rides, sightseeing, or running errands, rather than primarily for fixed-time commuting. Both user types have minimal activity during the late night and early morning hours. The difference in hourly patterns underscores the primary use cases for each user segment - routine transportation for members and more flexible, activity-based use for casual riders.
Show code
# Extract start hourcitibike_data_cleaned <- citibike_data_cleaned %>%mutate(hour =hour(started_at))# Plotggplot(citibike_data_cleaned, aes(x = hour, fill = member_casual)) +geom_histogram(binwidth =1, position ="dodge", color ="black") +scale_x_continuous(breaks =0:23) +labs(title ="Hourly distribution of trips",x ="Hour of the day",y ="Number of trips",fill ="User type" ) +theme_minimal()
4.4 Average duration by season of the year
The bar chart displays the average duration of bicycle trips across the four seasons. It clearly shows that the average trip duration is highest during the Summer, followed closely by Spring and then Fall. The average trip duration is significantly shorter in Winter compared to the other three seasons. This pattern aligns intuitively with weather conditions, as users are likely to take longer, more leisurely rides during the warmer and more pleasant months of Summer, Spring, and Fall. The considerably shorter average duration in Winter suggests that bike usage during this season may be limited to more essential or shorter trips, likely due to colder temperatures, adverse weather conditions, and shorter daylight hours making longer rides less appealing or practical.
Show code
library(dplyr)library(ggplot2)citibike_data_cleaned %>%group_by(season) %>%summarise(avg_duration =mean(duration, na.rm =TRUE)) %>%mutate(season =factor(season, levels =c("Spring", "Summer", "Fall", "Winter"))) %>%ggplot(aes(x = season, y = avg_duration, fill = season)) +geom_col() +labs(title ="Average duration by season of the year", y ="Duration (minutes)")
4.5 Average trip duration by bike type
The chart displays the average trip duration for classic and electric bicycles. It indicates that the average trip duration is slightly longer for classic bikes compared to electric bikes. While the difference appears relatively small, the visual representation suggests that users tend to spend a few more minutes on average during trips taken with classic bicycles. This could imply subtle differences in how these bike types are utilized, perhaps with electric bikes being favored for quicker trips or specific types of journeys where the electric assist helps reduce travel time.
Show code
library(dplyr)library(ggplot2)# Calculate average duration by bike typebike_type_duration <- citibike_data_cleaned %>%group_by(rideable_type) %>%summarise(avg_duration =mean(duration, na.rm =TRUE))# Create the plotggplot(bike_type_duration, aes(x = rideable_type, y = avg_duration, fill = rideable_type)) +geom_col() +labs(title ="Average duration by bike type", x ="Bike type", y ="Average duration (minutes)") +theme_minimal()
4.6 Comparative Analysis of Citi Bike Usage on Holidays vs. Regular Days in 2023
We know that users ride less during the cold seasons. Now we are going to evaluate if the users ride less when it is rainy or snowy, regardless of the temperature or season. To do so, we classify each day in the dataset according to the weather conditions — whether it had rain, snow, both, or was clear — using data from the weather API from visualcrossing. Then, we compare:
the percentage of total days falling into each weather category, and
the percentage of total rides that took place on those days.
By plotting both metrics side by side, we can observe whether users ride more or less than expected under each weather condition. For example, if 10% of the days were rainy but only 4% of rides occurred on those days, this suggests a negative impact of rain on usage. The chart also includes a label indicating whether the ride usage is higher (↑ Over-used), lower (↓ Under-used), or as expected, relative to the frequency of those weather days.
The chart clearly shows that Clear weather days account for the largest proportion of total days and, notably, an even larger proportion of total rides. This indicates that users are much more likely to ride on clear days, resulting in this category being significantly ↑ Over-used relative to its frequency.
Conversely, days with precipitation, whether Rain + Snow, Rain only, or Snow only, show a consistent pattern. While these conditions make up a certain percentage of the total days, the percentage of total rides occurring on these days is considerably lower than the percentage of days they represent. Specifically, days with Rain only occur more frequently than days with Rain + Snow or Snow only, but rides are still significantly ↓ Under-used compared to the frequency of rainy days. Both Rain + Snow and Snow only days are infrequent, and ride usage on these days is proportionally even lower, also categorized as ↓ Under-used.
In summary, the analysis strongly suggests that rainy and snowy weather conditions have a negative impact on Citi Bike usage, with users riding significantly less than would be expected based on the frequency of these weather types. Clear weather, on the other hand, is associated with a disproportionately higher volume of rides.
Show code
library(tidyverse)library(jsonlite)library(lubridate)# Load and process weather dataweather_json <-fromJSON("weather.json", flatten =TRUE)weather_days <- weather_json$daysweather_df <-as_tibble(weather_days) %>%mutate(date =as.Date(datetime),has_rain =map_lgl(preciptype, ~"rain"%in% .x),has_snow =map_lgl(preciptype, ~"snow"%in% .x),condition =case_when( has_rain & has_snow ~"Rain + Snow", has_rain ~"Rain only", has_snow ~"Snow only",TRUE~"Clear" ) ) %>%select(date, condition)# Add weather info to ride datarides <- citibike_data_cleaned %>%mutate(date =as.Date(started_at)) %>%left_join(weather_df, by ="date") %>%mutate(condition =replace_na(condition, "Unknown"))# --- 1. % of Days ---days_dist <- weather_df %>%count(condition, name ="n_days") %>%mutate(percent_days =round(100* n_days /sum(n_days), 1))# --- 2. % of Rides ---rides_dist <- rides %>%count(condition, name ="n_rides") %>%mutate(percent_rides =round(100* n_rides /sum(n_rides), 1))# --- 3. Join and compute Ride-to-Day ratio ---comparison <-full_join(days_dist, rides_dist, by ="condition") %>%mutate(ride_to_day_ratio = percent_rides / percent_days,usage_weight =case_when( ride_to_day_ratio >1~"↑ Over-used", ride_to_day_ratio <1~"↓ Under-used",TRUE~"Expected" ) )# View resultsprint(comparison)# --- 4. Plot ---ggplot(comparison, aes(x = condition)) +geom_col(aes(y = percent_days, fill ="Days"), position ="dodge", width =0.6) +geom_col(aes(y = percent_rides, fill ="Rides"), position ="dodge", width =0.6) +geom_text(aes(y =pmax(percent_days, percent_rides) +1,label = usage_weight), size =3.5, fontface ="bold") +labs(title ="Citi Bike Usage vs Weather Conditions (2024)",subtitle ="Comparison between proportion of weather days and ride distribution",x ="Weather Condition", y ="Percentage",fill ="" ) +scale_fill_manual(values =c("Days"="skyblue", "Rides"="orange")) +theme_minimal()
5 Spatial Analysis of Trips
5.1 Trip Density by Station
The spatial plot visualizing trip density by start station reveals significant geographical variations in the usage of the bike-sharing system. There are clear “hot spots” or areas with a high concentration of both stations and trip origins, indicated by the clustering of points and the larger, brightly colored markers (representing a high log-transformed number of trips). These areas likely correspond to major points of interest such as central business districts, dense residential neighborhoods, transportation hubs, or popular tourist destinations, where demand for bike rentals is highest.
Conversely, areas with fewer or smaller/darker points represent stations with lower trip volumes. The use of a log transformation on the color scale is particularly effective in this plot as it allows for better differentiation among the numerous stations with moderate to lower trip counts, making subtle variations in activity levels visible across the entire service area, not just highlighting the few busiest stations. The plot underscores that the popularity and usage of bike stations are highly dependent on their location and surrounding urban characteristics.
Show code
library(dplyr)library(ggplot2)library(viridis)station_density <- citibike_data_cleaned %>%group_by(start_station_name, start_lat, start_lng) %>%summarise(total_trips =n(), .groups ="drop")ggplot(station_density, aes(x = start_lng, y = start_lat)) +geom_point(aes(size = total_trips, color =log10(total_trips +1)), alpha =0.7) +scale_color_viridis() +scale_size_continuous(range =c(1, 10)) +labs(title ="Trip density by start station (Color - Log Transformed)",x ="Longitude", y ="Latitude",size ="Total trips", color ="Total trips (log10)") +theme_minimal()
Show code
ggplot(station_density, aes(x = start_lng, y = start_lat)) +geom_point(aes(size = total_trips, color = total_trips), alpha =0.7) +scale_color_viridis(option ="plasma") +scale_size_continuous(range =c(1, 10)) +labs(title ="Trip density by start station (Plasma Palette)",x ="Longitude", y ="Latitude", size ="Total trips", color ="Total trips") +theme_minimal()
5.2 Analysis of Most Frequent Trips: Start and End Stations
The horizontal bar chart displaying the “Top 10 start stations” highlights the busiest locations where bike trips originate within the system. “W 21 St & 6 Ave” stands out as the most frequent starting station, indicating a significant volume of trips beginning at this location compared to all others. The chart clearly ranks the top 10 stations by the number of originating trips, showing a decreasing volume from the first to the tenth station on the list. The presence of these specific stations in the top 10 suggests they are situated in key areas with high demand for bike rentals, likely due to factors such as proximity to transit, commercial areas, residential density, or popular destinations, making them critical hubs for the start of many journeys within the bike-sharing network.
Show code
library(dplyr)library(ggplot2)# Top 10 most frequent start stationstop_start_stations <- citibike_data_cleaned %>%filter(!is.na(start_station_name)) %>%count(start_station_name, sort =TRUE) %>%slice_max(n, n =10)# Plot for start stationsggplot(top_start_stations, aes(x =reorder(start_station_name, n), y = n, fill = start_station_name)) +geom_col(show.legend =FALSE) +coord_flip() +labs(title ="Top 10 start stations",x ="Start station",y ="Number of trips") +theme_minimal()
Show code
# Top 10 most frequent end stationstop_end_stations <- citibike_data_cleaned %>%filter(!is.na(end_station_name)) %>%count(end_station_name, sort =TRUE) %>%slice_max(n, n =10)# Plot for end stationsggplot(top_end_stations, aes(x =reorder(end_station_name, n), y = n, fill = end_station_name)) +geom_col(show.legend =FALSE) +coord_flip() +labs(title ="Top 10 end stations",x ="End station",y ="Number of trips") +theme_minimal()
6 Conclusions
Based on the analysis of Citi Bike usage data for 2024, the following conclusions can be drawn:
Seasonality significantly impacts ridership, with Summer experiencing the highest number of trips, followed by Autumn and Spring. Winter consistently shows the lowest ridership, which is an expected pattern due to colder weather and shorter days. Average trip duration is also longer in warmer months (Summer, Spring, Fall) compared to Winter.
User type reveals distinct usage patterns. Registered members constitute the majority of users and account for a substantially higher number of trips overall. Members primarily use the bikes for regular, shorter trips, likely for commuting or errands during weekdays, particularly during morning and evening peak hours. Casual users, while taking fewer trips, tend to have significantly longer average trip durations and use the service more for leisure activities on weekends and throughout the afternoon and early evening.
Weather conditions have a clear impact on usage. Clear weather days see a disproportionately higher volume of rides compared to their frequency, indicating they are “Over-used”. Conversely, days with rain or snow result in significantly lower ridership than expected based on the frequency of these conditions, indicating that users ride less when it is rainy or snowy. This suggests a negative impact of precipitation on Citi Bike usage.
Electric bikes are more popular in terms of trip volume, being used for a significantly higher number of trips compared to classic bikes. However, the average trip duration is slightly longer for classic bikes.
Geographical location of stations matters, with certain “hot spots” or areas showing a high concentration of trip origins, likely corresponding to areas with high demand such as business districts, residential areas, or transportation hubs. “W 21 St & 6 Ave” was identified as the most frequent starting station.