Upon examining the dataset, I’ve noticed that a significant challenge is the presence of incomplete data. Some entries are blank, while others are marked as “not_available.”
I’ve pinpointed several strategies to address incomplete data:
- Eliminate Rows or Columns: This strategy is effective when there are only a few missing values that are scattered randomly and don’t significantly affect the overall analysis.
- Value Imputation: This involves substituting missing entries with predicted or estimated values. Widely used techniques include:
- Average, Median, or Mode Imputation: Filling in missing entries with the average, median, or mode of the respective column.
- Linear Regression Imputation: Leveraging other variables to estimate and replace missing values.
- Interpolation: Estimating values based on adjacent data points, particularly useful for time-series datasets.
- K-Nearest Neighbors (KNN): Substituting missing values with values from rows that are similar based on other attributes.
- MICE (Multiple Imputation by Chained Equations): A sophisticated technique that takes into account the interdependencies between variables.
- Classify Missing Values: For categorical data, introducing a new category like “Unknown” or “N/A” for missing values can be insightful.
- Avoid Imputation and Consider as a Unique Category: Sometimes, the absence of data can be significant, and it might be more insightful to consider it as a separate category during analysis.
- Employ Advanced Statistical Methods: In intricate analyses, sophisticated techniques such as the Expectation-Maximization (EM) algorithms or structural equation modeling might be required to address incomplete data.
- Data Validation Protocols: Implementing data validation guidelines in Excel can deter the input of incomplete or incorrect data in subsequent entries. I plan to discuss with the course instructor and teaching assistants to decide on the best method for our dataset.
Regarding dataset 1 (fatal-police-shootings-data), I began with a preliminary statistical analysis:
For latitude: The central latitude value is 36.08117181, the minimum latitude is 19.4975033, the maximum latitude is 71.3012553, and the variability is 5.346323915. For longitude: The central longitude value is 36.08117181, the minimum longitude is -9.00718E+15, the maximum longitude is -67.8671657, and the variability is 1.02104E+14. For age: The central age value is 35, the youngest age is 2, the oldest age is 92, and the variability is 12.99. These figures hint at potential anomalies in the dataset, such as individuals as young as 2 or as old as 92 being involved in police-related shootings. This might suggest accidental discharges or other errors on the part of law enforcement.
The agency most frequently linked with the greatest number of police-related shootings is agency 38, which is the “Los Angeles Police Department.” The LAPD recorded the highest number of such incidents, with a total of 129 cases.