2 Data

2.1 Technical Description

2.1.1 About the data:

This is the New York City Rat Sightings Dataset, collected from NYC Open Data.

Link to Data Source: NYC Rat sightings dataset

The data comes from 311 service requests and we imported the data directly from the link above using NYC Open Data’s csv, downloaded onto our local system and then used it for the analysis.

2.1.2 Data update frequency on the website:

Everyday 9AM EST. This dataset contains various columns representing different aspects of rat sightings in New York City from 2010 to present day and is updated daily on the NYC Open Data Platform.

2.1.3 Descriptive stats about the data:

The data has 31 features with 232,536 rows. We have not detailed all the 31 features but instead only the ones which will be mainly used for our analysis. An overall description of the features is given in the data addendum section.

Created Date: The date and time when the sighting report was created. This will be important for a time series analysis of the dataset.
Location Type [loc_type]: The type of location where the sighting occurred. This categoric feature has 23 levels. Each level can be used to split the data for analyzing in-depth trends.
Incident Zip [zip]: The ZIP code where the sighting occurred. Used for geospatial analysis of the data.
Status [status]: The current status of the sighting report. This has 7 levels - Pending, Closed, Assigned, In-Progress, Open, Draft, Unspecified.
Borough [bor]: The borough in which the sighting occurred. This has 6 levels Queens, Brooklyn, Manhattan, Bronx, Staten Island and Unspecified.
Latitude [lat]: The latitude of the sighting location. Used for geo-spatial mapping of sightings.
Longitude [long]: The longitude of the sighting location. Used for geo-spatial mapping of sightings

2.2 Research Plan:

1. Breakdown of research objectives

Primary Objective:

Understand patterns and trends of rat sightings in New York City.

Secondary Objectives:

Investigating the seasonal and yearly trends in rat sightings.
Identifying hotspots for rat sightings.
Investigating trends around whether a rat sighting 311 request is closed.

2. Data Exploration

Primary features we will use for analysing this dataset:
- Created Date for time series analysis
- Incident Zip, Latitude, Longitude for geospatial analysis and plotting maps of rat sightings
- Location Type, Borough, Status for finding correlations

3. Descriptive Analysis

Initial Overview: Summary statistics and basic visualizations.
Temporal Analysis: Time-based trend analysis, segmented by categorical features like borough, location type.
Geographical Analysis: Map-based visualizations.
Correlation Analysis: Analysis of correlated categorical variables like created date, status, borough, location type.

2.3 Missing value analysis

2.3.1 Table of columns with missing values and their % missing out of the total data.

Code

library(dplyr)
library(tidyr)
library(ggplot2)

# Reading dataset
#"/Users/kaitlynbrown/Desktop/STAT5702/Rat_Sightings.csv"
data <- read.csv("/Users/kaitlynbrown/Desktop/STAT5702/Rat_Sightings.csv")
replace_blanks_with_na <- function(x) {
  if (is.character(x)) {
    return(ifelse(x == "", NA, x))
  } else if (is.integer(x)) {
    return(ifelse(x == 0, NA_integer_, x)) # Replace '0' with NA for integers
  } else {
    return(x)
  }
}
# replace every blank entry with NA
data <- data |> 
  mutate(across(everything(), replace_blanks_with_na))


#renaming columns which are essential for our analysis
cnames<- colnames(data)

#renaming these columns
data <- data |> 
rename(
    key = names(data)[1], 
    loc_type = names(data)[8], 
    zip = names(data)[9],
    add = names(data)[10],
    add_type = names(data)[16],
    city = names(data)[17],
    status = names(data)[20],
    due = names(data)[21],
    bor = names(data)[24],
    lat = names(data)[29],
    long = names(data)[30]
  )

# Total number of rows in the dataset
total_rows <- nrow(data)

# Calculate missing values count and percentage for each column
missing_values <- data |> 
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "count_missing") |>
  mutate(percentage_missing = round((count_missing / total_rows) * 100, 2)) |>
  filter(count_missing > 0) |> # Filter out columns with no missing values
  arrange(desc(percentage_missing)) # Arrange in descending order of missing percentage
# Print the result
print(missing_values)

# A tibble: 23 × 3
   Feature               count_missing percentage_missing
   <chr>                         <int>              <dbl>
 1 Landmark                     138744              59.7 
 2 Intersection.Street.1        118273              50.9 
 3 Intersection.Street.2        118233              50.8 
 4 Facility.Type                107309              46.2 
 5 due                          100304              43.1 
 6 Cross.Street.1                26376              11.3 
 7 Cross.Street.2                26365              11.3 
 8 Closed.Date                   17364               7.47
 9 add                           10942               4.71
10 Street.Name                   10943               4.71
# ℹ 13 more rows

We plot a chart which shows all columns and their missing values:

Code

library(naniar)
library(tidyr)
library(ggplot2)
library(scales)

# Visualize missing values in each column
plot <- gg_miss_var(data)

# Modify the x-axis labels to display in the form of 1000's
plot + scale_y_continuous(labels = label_number_si()) +
  labs(
    title = "Missing Data Visualization for all columns",
    x = "Column Names",
    y = "Missing Values (in thousands)"
  )

In the next step, we drop the following columns:

Columns with more than 10% missing values
Columns which are non-essential to our analysis. The list of essential columns is given in the descriptive stats section of this page.

Revised plot of analysis essential columns which have less than 10% missing values:

Code

library(naniar)
library(tidyr)
library(ggplot2)
library(scales)

# Picking only columns with less than 10% missing values and those which are essential to our analysis.

columns_over_10_percent_missing <- missing_values %>%
  filter(percentage_missing > 10) %>%
  pull(Feature)

#remove columns with over 10% missing values
cleaned_data <- data |> 
  select(-all_of(columns_over_10_percent_missing))

# select only those columns relevant to the analysis
cleaned_data <- data |>
  select(1,8,10,16,17,20,24,29,30)


# Visualize missing values in each column
plot <- gg_miss_var(cleaned_data)

# Modify the x-axis labels to display in the form of 1000's
plot + scale_y_continuous(labels = label_number_si()) +
  labs(
    title = "Revised plot with essential columns having < 10% missing values",
    x = "Column Names",
    y = "Missing Values (in thousands)"
  )

Additional visuals on columns with missing data:

Using the Redav library for plotting missing values with % instead of absolute counts.

Code

library(redav)
library(dplyr)

plot_missing(cleaned_data, percent = TRUE)

Conclusion from missing data analysis:

As per our analysis - Address , Address type, City, Longitude and Latitude of rat sightings are the ones with the highest missing values (still under 10%). We cannot impute these missing values by any means and our final analysis removed rows which had these missing values.