Data Processing & Cleaning

Author

Antonio Flores

Published

July 2, 2024

library(tidyr)# For simplifying data cleaning
Warning: package 'tidyr' was built under R version 4.3.3
library(here) #For reading in data from absolute path
here() starts at C:/Users/Client/Documents/antonioflores-P2-portfolio
library(dplyr)# For Piping data
Warning: package 'dplyr' was built under R version 4.3.3

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2) # EDA Charts/Graphs
Warning: package 'ggplot2' was built under R version 4.3.3
#Reading in the Data
datalocation = here("cdcdata-exercise", "data", "Center_for_Medicare___Medicaid_Services__CMS____Medicare_Claims_data_20240701.csv")
data = read.csv(datalocation)

The dataset comes from the CDC’s data repository and can be accessed with this link. Options exist for exporting via CSV or through the use of an API.

Heart Disease Stroke Prevention

The Center for Medicare & Medicaid Services (CMS) gathers large amounts of health data from Medicare/Medicaid patients. From this data source, indicators have been computed by the CDC’s Division of Heart Disease and Stroke Prevention (DHDSP), and that is the dataset that will be used for this activity.

Before Cleaning: Each row contains a single data value for a particular category of a demographic (which is one of Race, Gender, Age) for each topic (Heart Failure, Stroke, etc), for each state, for each year(2016-2021).

In other words, for each year, there is a row for each state, for each state there is a row for each topic (5), for every topic there is a row for either percentage or rate, for each of these rows, there is a row for either gender, age, or race, and for each demographic there is one row per option (Male, Female, over75, etc.)

str(data)
'data.frame':   33454 obs. of  30 variables:
 $ RowId                     : logi  NA NA NA NA NA NA ...
 $ YearStart                 : int  2016 2017 2018 2019 2020 2021 2017 2021 2019 2018 ...
 $ LocationAbbr              : chr  "US" "US" "US" "US" ...
 $ LocationDesc              : chr  "United States" "United States" "United States" "United States" ...
 $ DataSource                : chr  "Medicare" "Medicare" "Medicare" "Medicare" ...
 $ PriorityArea1             : chr  "None" "None" "None" "None" ...
 $ PriorityArea2             : logi  NA NA NA NA NA NA ...
 $ PriorityArea3             : chr  "None" "None" "None" "None" ...
 $ PriorityArea4             : logi  NA NA NA NA NA NA ...
 $ Class                     : chr  "Cardiovascular Diseases" "Cardiovascular Diseases" "Cardiovascular Diseases" "Cardiovascular Diseases" ...
 $ Topic                     : chr  "Major Cardiovascular Disease" "Major Cardiovascular Disease" "Major Cardiovascular Disease" "Major Cardiovascular Disease" ...
 $ Question                  : chr  "Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare" "Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare" "Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare" "Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare" ...
 $ Data_Value_Type           : chr  "Crude" "Crude" "Crude" "Crude" ...
 $ Data_Value_Unit           : chr  "Rate per 100,000" "Rate per 100,000" "Rate per 100,000" "Rate per 100,000" ...
 $ Data_Value                : num  2368 2404 2482 5395 4505 ...
 $ Data_Value_Alt            : num  2368 2404 2482 5395 4505 ...
 $ Data_Value_Footnote_Symbol: logi  NA NA NA NA NA NA ...
 $ Data_Value_Footnote       : logi  NA NA NA NA NA NA ...
 $ Low_Confidence_Limit      : num  2318 2358 2439 5382 4493 ...
 $ High_Confidence_Limit     : num  2418 2450 2526 5408 4518 ...
 $ Break_Out_Category        : chr  "Race" "Race" "Race" "Gender" ...
 $ Break_Out                 : chr  "Unknown" "Unknown" "Unknown" "Male" ...
 $ ClassId                   : chr  "C1" "C1" "C1" "C1" ...
 $ TopicId                   : chr  "T1" "T1" "T1" "T1" ...
 $ QuestionId                : chr  "MD101" "MD101" "MD101" "MD101" ...
 $ Data_Value_TypeID         : chr  "Crude" "Crude" "Crude" "Crude" ...
 $ BreakOutCategoryId        : chr  "BOC04" "BOC04" "BOC04" "BOC02" ...
 $ BreakOutId                : chr  "RAC08" "RAC08" "RAC08" "GEN01" ...
 $ LocationId                : int  59 59 59 59 59 59 59 59 30 59 ...
 $ GeoLocation               : chr  "" "" "" "" ...
summary(data)
  RowId           YearStart    LocationAbbr       LocationDesc      
 Mode:logical   Min.   :2016   Length:33454       Length:33454      
 NA's:33454     1st Qu.:2017   Class :character   Class :character  
                Median :2018   Mode  :character   Mode  :character  
                Mean   :2018                                        
                3rd Qu.:2020                                        
                Max.   :2021                                        
  DataSource        PriorityArea1      PriorityArea2  PriorityArea3     
 Length:33454       Length:33454       Mode:logical   Length:33454      
 Class :character   Class :character   NA's:33454     Class :character  
 Mode  :character   Mode  :character                  Mode  :character  
                                                                        
                                                                        
                                                                        
 PriorityArea4     Class              Topic             Question        
 Mode:logical   Length:33454       Length:33454       Length:33454      
 NA's:33454     Class :character   Class :character   Class :character  
                Mode  :character   Mode  :character   Mode  :character  
                                                                        
                                                                        
                                                                        
 Data_Value_Type    Data_Value_Unit      Data_Value     Data_Value_Alt  
 Length:33454       Length:33454       Min.   :   0.0   Min.   :   0.0  
 Class :character   Class :character   1st Qu.:   5.1   1st Qu.:   5.1  
 Mode  :character   Mode  :character   Median :  25.0   Median :  25.0  
                                       Mean   : 845.1   Mean   : 845.1  
                                       3rd Qu.: 965.5   3rd Qu.: 965.5  
                                       Max.   :8118.2   Max.   :8118.2  
 Data_Value_Footnote_Symbol Data_Value_Footnote Low_Confidence_Limit
 Mode:logical               Mode:logical        Min.   :  -1.0      
 NA's:33454                 NA's:33454          1st Qu.:   3.5      
                                                Median :   8.3      
                                                Mean   : 742.4      
                                                3rd Qu.: 825.9      
                                                Max.   :7925.8      
 High_Confidence_Limit Break_Out_Category  Break_Out           ClassId         
 Min.   :   -1.0       Length:33454       Length:33454       Length:33454      
 1st Qu.:    7.2       Class :character   Class :character   Class :character  
 Median :   92.5       Mode  :character   Mode  :character   Mode  :character  
 Mean   : 1027.8                                                               
 3rd Qu.: 1212.2                                                               
 Max.   :18238.6                                                               
   TopicId           QuestionId        Data_Value_TypeID  BreakOutCategoryId
 Length:33454       Length:33454       Length:33454       Length:33454      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  BreakOutId          LocationId    GeoLocation       
 Length:33454       Min.   : 1.00   Length:33454      
 Class :character   1st Qu.:17.00   Class :character  
 Mode  :character   Median :29.00   Mode  :character  
                    Mean   :29.56                     
                    3rd Qu.:44.00                     
                    Max.   :59.00                     
head(data)
  RowId YearStart LocationAbbr  LocationDesc DataSource PriorityArea1
1    NA      2016           US United States   Medicare          None
2    NA      2017           US United States   Medicare          None
3    NA      2018           US United States   Medicare          None
4    NA      2019           US United States   Medicare          None
5    NA      2020           US United States   Medicare          None
6    NA      2021           US United States   Medicare          None
  PriorityArea2 PriorityArea3 PriorityArea4                   Class
1            NA          None            NA Cardiovascular Diseases
2            NA          None            NA Cardiovascular Diseases
3            NA          None            NA Cardiovascular Diseases
4            NA          None            NA Cardiovascular Diseases
5            NA          None            NA Cardiovascular Diseases
6            NA          None            NA Cardiovascular Diseases
                         Topic
1 Major Cardiovascular Disease
2 Major Cardiovascular Disease
3 Major Cardiovascular Disease
4 Major Cardiovascular Disease
5 Major Cardiovascular Disease
6 Major Cardiovascular Disease
                                                                                                            Question
1 Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare
2 Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare
3 Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare
4 Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare
5 Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare
6 Major cardiovascular disease hospitalization rate among Medicare Fee-For-Service beneficiaries (65+); CMS Medicare
  Data_Value_Type  Data_Value_Unit Data_Value Data_Value_Alt
1           Crude Rate per 100,000     2367.6         2367.6
2           Crude Rate per 100,000     2403.7         2403.7
3           Crude Rate per 100,000     2482.3         2482.3
4           Crude Rate per 100,000     5395.0         5395.0
5           Crude Rate per 100,000     4505.4         4505.4
6           Crude Rate per 100,000     4694.6         4694.6
  Data_Value_Footnote_Symbol Data_Value_Footnote Low_Confidence_Limit
1                         NA                  NA               2318.3
2                         NA                  NA               2358.1
3                         NA                  NA               2438.9
4                         NA                  NA               5381.7
5                         NA                  NA               4493.3
6                         NA                  NA               4682.0
  High_Confidence_Limit Break_Out_Category Break_Out ClassId TopicId QuestionId
1                2417.9               Race   Unknown      C1      T1      MD101
2                2450.2               Race   Unknown      C1      T1      MD101
3                2526.4               Race   Unknown      C1      T1      MD101
4                5408.2             Gender      Male      C1      T1      MD101
5                4517.5             Gender      Male      C1      T1      MD101
6                4707.2             Gender      Male      C1      T1      MD101
  Data_Value_TypeID BreakOutCategoryId BreakOutId LocationId GeoLocation
1             Crude              BOC04      RAC08         59            
2             Crude              BOC04      RAC08         59            
3             Crude              BOC04      RAC08         59            
4             Crude              BOC02      GEN01         59            
5             Crude              BOC02      GEN01         59            
6             Crude              BOC02      GEN01         59            

Data Cleaning

First, we’re just changing some fields that were mislabeled (e.g., Arizona Abbreviation was AR), converting categorical variables to factors, and removing useless columns)

cleaned_data = data %>% 
  mutate(LocationAbbr =replace(LocationAbbr, LocationDesc=="Arizona", "AZ")) %>% 
  mutate(LocationAbbr =replace(LocationAbbr, LocationDesc=="Maryland", "MD")) %>% 
  mutate(LocationAbbr = as.factor(LocationAbbr)) %>% 
  mutate(LocationDesc = as.factor(LocationDesc)) %>% 
  mutate(DataSource = as.factor(DataSource)) %>% 
  mutate(Class = as.factor(Class)) %>% 
  mutate(Topic = as.factor(Topic)) %>% 
  mutate(Question = as.factor(Question)) %>% 
  mutate(Data_Value_Type = as.factor(Data_Value_Type)) %>%
  mutate(Data_Value_Unit = as.factor(Data_Value_Unit)) %>% 
  mutate(Data_Value_Unit = as.factor(Data_Value_Unit)) %>% 
  mutate(Break_Out_Category = as.factor(Break_Out_Category)) %>% 
  mutate(Break_Out = as.factor(Break_Out)) %>% 
  select(!DataSource) %>% 
  select(!RowId) %>% 
  select(!Class) %>% 
  select(!Data_Value_Type) %>% 
  select(!Data_Value_Footnote_Symbol) %>% 
  select(!Data_Value_Footnote) %>% 
  select(!ClassId:GeoLocation) %>% 
  select(!PriorityArea1:PriorityArea4)

Next, in order to better assess the data, I tried to widen out the data so there would only be one line for every topic per state per year.

(There was probably a way more efficient way to do this, but I couldn’t find it)

#Breaking into three data frames for each main social demographic
#Using pivot wider to turn demographic options into individual columns

###Race
racedata = cleaned_data %>% 
  filter(Break_Out_Category=="Race") %>% 
  select(YearStart, LocationAbbr, Data_Value_Unit, Topic, Data_Value, Break_Out_Category, Break_Out)

longer_race = racedata %>% 
  pivot_wider(names_from = Break_Out, values_from = Data_Value) %>% 
  arrange(LocationAbbr,YearStart, Topic, Data_Value_Unit) 
  
longer_race$ID = seq.int(nrow(longer_race))

####Gender
genderdata = cleaned_data %>% 
  filter(Break_Out_Category=="Gender") %>% 
  select(YearStart, LocationAbbr, Data_Value_Unit, Topic, Data_Value, Break_Out_Category, Break_Out)

longer_gender = genderdata %>% 
  pivot_wider(names_from = Break_Out, values_from = Data_Value) %>% 
  arrange(LocationAbbr,YearStart, Topic, Data_Value_Unit) 

longer_gender$ID = seq.int(nrow(longer_gender))

###Age

agedata = cleaned_data %>% 
  filter(Break_Out_Category=="Age") %>% 
  select(YearStart, LocationAbbr, Data_Value_Unit, Topic, Data_Value, Break_Out_Category, Break_Out)

longer_age = agedata %>% 
  pivot_wider(names_from = Break_Out, values_from = Data_Value) %>% 
  arrange(LocationAbbr,YearStart, Topic, Data_Value_Unit) 

longer_age$ID = seq.int(nrow(longer_age))


###Need to join these data frames together
all_long1 = left_join(longer_race, longer_gender, by = 'ID')

all_long = left_join(all_long1, longer_age, by = 'ID')

Just removing deuplicate columns

#selecting only non-duplicate columns
clean_all = all_long %>% 
  select(YearStart.x:Topic.x,Unknown:ID, Male:Female,`75+`)

Trying to widen out the data some more with percent and rate data

percent_date = clean_all %>% 
  filter(Data_Value_Unit.x =="Percent (%)")
  
percent_date$Second_ID = seq.int(nrow(percent_date))

rate_data = clean_all %>% 
  filter(Data_Value_Unit.x=="Rate per 100,000")
  
rate_data$Second_ID = seq.int(nrow(rate_data))

rate_percent_data = left_join(percent_date, rate_data, by = "Second_ID")
  
#again removing uneeded columns
rate_percent_data1 = rate_percent_data %>%   
  select(YearStart.x.x:LocationAbbr.x.x, Topic.x.x:`Non-Hispanic Black.x`, Male.x:Second_ID, Unknown.y:`Non-Hispanic Black.y`, Male.y:`75+.y`)

Finally, renaming all remaining columns

good_data = rate_percent_data1 %>% 
  rename(Year = YearStart.x.x,
         Location = LocationAbbr.x.x,
         Topic = Topic.x.x,
         Unknown_pct = Unknown.x,
         Non_Hispanic_Asian_pct = `Non-Hispanic Asian.x`,
         Non_Hispanic_White_pct= `Non-Hispanic White.x`,
         Hispanic_pct = Hispanic.x,
         Other_pct = Other.x,
         Non_Hispanic_Black_pct = `Non-Hispanic Black.x`,
         Male_pct = Male.x,
         Female_pct = Female.x,
         Is75plus_pct = `75+.x`,
         RowID = Second_ID,
         Unknown_rate = Unknown.y,
         Non_Hispanic_Asian_rate = `Non-Hispanic Asian.y`,
         Non_Hispanic_White_rate= `Non-Hispanic White.y`,
         Hispanic_rate = Hispanic.y,
         Other_rate = Other.y,
         Non_Hispanic_Black_rate = `Non-Hispanic Black.y`,
         Male_rate = Male.y,
         Female_rate = Female.y,
         Is75plus_rate = `75+.y`
         )

At this time, our data now has the following structure:

There 36 rows for each state/territory (52), consisting of 6 rows per year (6 years), one for each topic.

summary(good_data)
      Year         Location                                           Topic    
 Min.   :2016   AK     :  36   Acute Myocardial Infarction (Heart Attack):312  
 1st Qu.:2017   AL     :  36   Coronary Heart Disease                    :312  
 Median :2018   AR     :  36   Diseases of the Heart (Heart Disease)     :312  
 Mean   :2018   AZ     :  36   Heart Failure                             :312  
 3rd Qu.:2020   CA     :  36   Major Cardiovascular Disease              :312  
 Max.   :2021   CO     :  36   Stroke                                    :312  
                (Other):1656                                                   
  Unknown_pct      Non_Hispanic_Asian_pct Non_Hispanic_White_pct
 Min.   :  0.000   Min.   :  0.000        Min.   : 0.00         
 1st Qu.:  0.000   1st Qu.:  0.000        1st Qu.: 4.40         
 Median :  2.800   Median :  5.850        Median : 5.10         
 Mean   :  3.895   Mean   :  7.006        Mean   : 5.42         
 3rd Qu.:  4.900   3rd Qu.:  9.100        3rd Qu.: 6.30         
 Max.   :100.000   Max.   :100.000        Max.   :13.30         
 NA's   :32        NA's   :86                                   
  Hispanic_pct       Other_pct       Non_Hispanic_Black_pct    Male_pct     
 Min.   :  0.000   Min.   :  0.000   Min.   :  0.000        Min.   : 0.000  
 1st Qu.:  0.000   1st Qu.:  3.400   1st Qu.:  3.000        1st Qu.: 4.400  
 Median :  3.800   Median :  5.700   Median :  5.000        Median : 5.100  
 Mean   :  5.193   Mean   :  6.422   Mean   :  5.273        Mean   : 5.389  
 3rd Qu.:  7.100   3rd Qu.:  8.300   3rd Qu.:  6.400        3rd Qu.: 6.200  
 Max.   :100.000   Max.   :100.000   Max.   :100.000        Max.   :15.700  
 NA's   :138       NA's   :11        NA's   :47                             
   Female_pct      Is75plus_pct        RowID         Unknown_rate   
 Min.   : 0.000   Min.   : 0.000   Min.   :   1.0   Min.   :   0.0  
 1st Qu.: 4.400   1st Qu.: 5.000   1st Qu.: 468.8   1st Qu.: 346.6  
 Median : 5.300   Median : 6.100   Median : 936.5   Median : 555.2  
 Mean   : 5.614   Mean   : 6.395   Mean   : 936.5   Mean   : 937.1  
 3rd Qu.: 6.700   3rd Qu.: 7.500   3rd Qu.:1404.2   3rd Qu.:1669.2  
 Max.   :13.600   Max.   :15.900   Max.   :1872.0   Max.   :3857.4  
                                                                    
 Non_Hispanic_Asian_rate Non_Hispanic_White_rate Hispanic_rate   
 Min.   :   0.0          Min.   :  61.8          Min.   :   0.0  
 1st Qu.: 445.8          1st Qu.: 645.1          1st Qu.: 440.3  
 Median : 817.1          Median : 955.3          Median : 852.6  
 Mean   :1254.5          Mean   :1748.8          Mean   :1423.6  
 3rd Qu.:2054.8          3rd Qu.:3185.7          3rd Qu.:2283.8  
 Max.   :5405.4          Max.   :6083.1          Max.   :7462.7  
                                                                 
   Other_rate     Non_Hispanic_Black_rate   Male_rate       Female_rate    
 Min.   :   0.0   Min.   :   0.0          Min.   :  53.1   Min.   :  45.8  
 1st Qu.: 610.3   1st Qu.: 607.8          1st Qu.: 784.3   1st Qu.: 497.6  
 Median : 955.4   Median :1177.4          Median :1125.1   Median : 835.8  
 Mean   :1639.8   Mean   :2005.0          Mean   :2006.4   Mean   :1553.9  
 3rd Qu.:2716.2   3rd Qu.:3533.8          3rd Qu.:3618.7   3rd Qu.:2821.6  
 Max.   :6491.8   Max.   :7679.1          Max.   :6436.4   Max.   :5799.6  
                                                                           
 Is75plus_rate 
 Min.   :  77  
 1st Qu.: 829  
 Median :1309  
 Mean   :2449  
 3rd Qu.:4524  
 Max.   :8118  
               

Exploratory Data Analysis

The main variables I am exploring here are 1) Location 2) Year 3) Topic 4) Hispanic_pct 5) Hispanic_rate (Rate per 100,000)

Hispanic Rate Over Time, colored by Topic

  good_data %>% ggplot(aes(x=Year, y=Hispanic_rate, color = Topic)) + geom_point()

Hispanic Rate Boxplot

boxplot(good_data$Hispanic_rate)

Exploring the Quartiles for Hispanic Rate/Percentage

summary(good_data$Hispanic_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   440.3   852.6  1423.6  2283.8  7462.7 
summary(good_data$Hispanic_pct)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   3.800   5.193   7.100 100.000     138 

Hispanic Rate Distributions across Different Topics

ggplot(good_data, aes(Hispanic_rate))+
  geom_boxplot()+
  facet_wrap(.~Topic)+
  coord_flip()

Hispanic Rate Distributions across Years

ggplot(data=good_data)+
  geom_boxplot(mapping =aes(Hispanic_rate))+
  facet_wrap(.~Year)+
  coord_flip()

Mean Hispanic Rate across Topic

good_data %>% 
  group_by(Topic) %>% 
  summarise(Mean = mean(Hispanic_rate, na.rm=TRUE)) %>% 
  ggplot(aes(x=reorder(Topic, Mean), y=Mean))+
    geom_bar(stat = "identity")+
  coord_flip()

Mean Hispanic Percent across the Topic

good_data %>% 
  group_by(Topic) %>% 
  summarise(Mean = mean(Hispanic_pct, na.rm=TRUE)) %>% 
  ggplot(aes(x=reorder(Topic, Mean), y=Mean))+
    geom_bar(stat = "identity")+
  coord_flip()

Hispanic Rate across Location, colored by Topic

good_data %>% 
  group_by(Location, Topic) %>% 
  summarise(Mean = mean(Hispanic_rate, na.rm=TRUE)) %>% 
  ggplot(aes(x=reorder(Location, Mean), y=Mean, fill=Topic))+
    geom_bar(stat = "identity")+
  coord_flip()
`summarise()` has grouped output by 'Location'. You can override using the
`.groups` argument.