how to clean the data in r

Following is a process for data cleaning, standard tools that are used to explore and discover characteristics that need to be cleaned and tools that are used to transform data into that cleaned state.

The dataset used in this demo came from this excellent course in data cleaning hosted on Datacamp

Before diving in, figure out the basic structure of the data. Look at things like the class() of the data object to make sure it is what we expect (usually a data.frame). Check the dimensions with dim() and the column names with names()

Look at summaries of the data. Functions str() , glimpse() , and summary() each give insight into the data.

Functions head() and tail() allow you to view the top and bottom rows of the data.

This weather dataset suffers from one of the five most common symptoms of messy data: column names that are values. Column names X1-X31 represent days of the month. They should be values of a variable called day.

The tidyr package provides the gather() function for this scenario.

gather(df, time, val, t1:t3)

gather() allows you to select multiple columns to be gathered by using the : operator.

                                                 weather2                                                 <-                                                 gather                (                                weather                ,                                                                 day                ,                                                                 value                ,                                                                 X1                :                X31                ,                                                                 na.rm                                                 =                                                 TRUE                )

Values are variable names

Our data suffer from a second common symptom of messy data: values are variable names. Specifically, values in the measure column should be variables (i.e. column names) in our dataset.

The spread() function from tidyr is designed to help with this. To remind you of how this function works, we've loaded another small dataset called df2 (which is the result of applying gather() to the original df from last exercise). Give the following a try before attempting the instructions below.

spread(df2, time, val)

Note how the values of the time column now become column names. The tidyr package is already loaded.

                                                 weather2                                                 <-                                                 weather2                [                                ,                                                                 -                1                ]

                                                 weather3                                                 <-                                                 spread                (                                weather2                ,                                                                 measure                ,                                                                 value                )

                                                 library                (                                tidyr                )                                                                 library                (                                dplyr                )                                                                 library                (                                stringr                )                                                                 library                (                                lubridate                )

                                                 weather3                $                day                                                 <-                                                 str_replace                (                                weather3                $                day                ,                                                                 "                X"                ,                                                                 "                "                )

                                                 weather4                                                 <-                                                 unite                (                                weather3                ,                                                                 date                ,                                                                 year                ,                                                                 month                ,                                                                 day                ,                                                                 sep                                                 =                                                 "                -"                )

                                                 weather4                $                date                                                 <-                                                 ymd                (                                weather4                $                date                )

                                                 weather5                                                 <-                                                 select                (                                weather4                ,                                                                 date                ,                                                                 Events                ,                                                                 CloudCover                :                WindDirDegrees                )

A closer look at column types

It's important for analysis that variables are coded appropriately. This is not yet the case with our weather data. Recall that functions such as as.numeric() and as.character() can be used to coerce variables into different types.

It's important to keep in mind that coercions are not always successful, particularly if there's some data in a column that you don't expect. For example, the following will cause problems:

as.numeric(c(4, 6.44, "some string", 222))

If you run the code above in the console, you'll get a warning message saying that R introduced an NA in the process of coercing to numeric. This is because it doesn't know how to make a number out of a string ("some string"). Watch out for this in our weather data!

                                                 as.numeric                (                                weather5                $                PrecipitationIn                )

As you saw in the last exercise, "T" was used to denote a trace amount (i.e. too small to be accurately measured) of precipitation in the PrecipitationIn column. In order to coerce this column to numeric, you'll need to deal with this somehow. To keep things simple, we will just replace "T" with zero, as a string ("0").

                                                                                                                   weather5                $                PrecipitationIn                                                 <-                                                 str_replace                (                                weather5                $                PrecipitationIn                ,                                                                 "                T"                ,                                                                 "                0"                )

                                                 weather6                                                 <-                                                 mutate_each                (                                weather5                ,                                                                 funs                (                                as.numeric                )                                ,                                                                 CloudCover                :                WindDirDegrees                )

Missing values in R should be represented by NA , but unclean datasets will have other values representing NA .

If missing values are properly coded as NA , the is.na() function will help you find them. You may need to try searching for some of the usual suspects like "" , "#N/A" , etc. You can also use the summary() and table() functions to turn up unexpected values in your data.

                                                 ind                                                 <-                                                 which                (                                is.na                (                                weather6                $                Max.Gust.SpeedMPH                )                                )

Besides missing values, we want to know if there are values in the data that are too extreme or bizarre to be plausible. A great way to start the search for these values is with summary().

Once implausible values are identified, they must be dealt with in an intelligent and informed way. Sometimes the best way forward is obvious and other times it may require some research and/or discussions with the original collectors of the data.

                                                 ind                                                 <-                                                 which                (                                weather6                $                Max.Humidity                                                 =                =                                                 1000                )

                                                 weather6                $                Max.Humidity                [                                ind                ]                                                                 <-                                                 100

                                                 summary                (                                weather6                $                Mean.VisibilityMiles                )

                                                 ind                                                 <-                                                 which                (                                weather6                $                Mean.VisibilityMiles                                                 =                =                                                 -                1                )

                                                 weather6                $                Mean.VisibilityMiles                [                                ind                ]                                                                 <-                                                 10

Check for other extreme values

In addition to dealing with obvious errors in the data, we want to see if there are other extreme values. In addition to the trusty summary() function, hist() is useful for quickly getting a feel for how different variables are distributed.

                                                       hist                  (                                    weather6                  $                  MeanDew.PointF                  )

                                                       hist                  (                                    weather6                  $                  Min.TemperatureF                  )

                                                       hist                  (                                    weather6                  $                  Mean.TemperatureF                  )

Before officially calling our weather data clean, we want to put a couple of finishing touches on the data. These are a bit more subjective and may not be necessary for analysis, but they will make the data easier for others to interpret, which is generally a good thing.

There are a number of stylistic conventions in the R language. Depending on who you ask, these conventions may vary. Because the period (.) has special meaning in certain situations, we generally recommend using underscores (_) to separate words in variable names. We also prefer all lowercase letters so that no one has to remember which letters are uppercase or lowercase.

Finally, the events column (renamed to be all lowercase in the first instruction) contains an empty string ("") for any day on which there was no significant weather event such as rain, fog, a thunderstorm, etc. However, if it's the first time you're seeing these data, it may not be obvious that this is the case, so it's best for us to be explicit and replace the empty strings with something more meaningful.

                new_colnames                                                 <-                                                 c                (                                "                date"                ,                                                                 "                events"                                                 ,                                                                 "                cloud_cover"                                                               ,                                "                max_dew_point_f"                                                                                            ,                                "                max_gust_speed_mph"                                                         ,                                "                max_humidity"                                                                                               ,                                "                max_sea_level_pressure_in"                                                  ,                                "                max_temperature_f"                                                                                          ,                                "                max_visibility_miles"                                                       ,                                "                max_wind_speed_mph"                                                                                         ,                                "                mean_humidity"                                                              ,                                "                mean_sea_level_pressure_in"                                                 ,                                "                mean_temperature_f"                                                         ,                                "                mean_visibility_miles"                                                                                      ,                                "                mean_wind_speed_mph"                                                        ,                                "                mean_dew_point_f"                                                                                           ,                                "                min_dew_point_f"                                                            ,                                "                min_humidity"                                                                                               ,                                "                min_sea_level_pressure_in"                                                  ,                                "                min_temperature_f"                                                                                          ,                                "                min_visibility_miles"                                                       ,                                "                precipitation_in"                                                                                           ,                                "                wind_dir_degrees"                )

                                                 names                (                                weather6                )                                                                 <-                                                 new_colnames

                                                 weather6                $                events                [                                weather6                $                events                                                 ==                                                 "                "                ]                                                                 <-                                                 "                None"