How To Use Filter In R

Filtering Data with dplyr

Filtering data is one of the very basic operation when you work with data. You want to remove a part of the data that is invalid or merely you're non interested in. Or, y'all want to zero in on a particular office of the data you desire to know more about. Of course, dplyr has 'filter()' function to practice such filtering, merely there is even more. With dplyr you can do the kind of filtering, which could be hard to perform or complicated to construct with tools like SQL and traditional BI tools, in such a simple and more intuitive way.

Let's brainstorm with some simple ones. Again, I'll employ the same flying data I accept imported in the previous mail.

Select columns

First, let's select columns that are interesting for now. If you desire to know more about 'how to select columns' please bank check this post I have written earlier.

          library(dplyr)          flying %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME)

Filter with a value

Permit's say yous want to see merely the flights of United Airline (UA). You tin run something like beneath.

          flight %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
                          filter(CARRIER == "UA")

If yous want to use 'equal' operator you need to have two '=' (equal sign) together like above. If you run the above yous'll see something like below.

And now, allow's find the flights that are of United Airline (UA) and left San Francisco airport (SFO). Y'all tin can use '&' operator equally AND and '|' operator as OR to connect multiple filter atmospheric condition. This fourth dimension we'll use '&'.

          flying %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
                          filter(CARRIER == "UA" & ORIGIN == "SFO")

Or, you might want to meet simply the flights that left San Francisco airport (SFO) but are non of United Airline (UA). You lot can utilise '!=' operator as 'non equal'.

          flight %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
                          filter(CARRIER != "UA" & ORIGIN == "SFO")

Filtering with multiple values

What if you want to encounter only the data for the flights that are of either United Airline (UA) or American Airline (AA) ? You tin use '%in%' for this, just like the IN operator in SQL.

          flight %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
            filter(CARRIER %in% c("UA", "AA"))

We can't really tell if it's working or not by looking at the offset 10 rows. Permit'south run count() function to summarize this quickly.

          flight %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
            filter(CARRIER %in% c("UA", "AA")) %>%
            count(CARRIER)

We can see but AA and UA equally we expected. And yes, I know, this 'count()' part is amazing. Information technology literally does what you would intuitively imagine. It returns the number of the rows for each specified group, in this case that is CARRIER. Nosotros could have done this by using 'group_by()' and 'summarize()' functions, but for something like this simple 'count()' office lonely does the task in such a quick way.

Contrary the condition logic

What if you want to see the flight that are not United Airline (UA) and American Airline (AA) this fourth dimension ? It'south really very simple with R and dplyr. Here'south a magic i letter of the alphabet you can employ with any condition to reverse the issue. It's '!' (exclamation marking). And, it goes like this.

          flight %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
            filter(!CARRIER %in% c("UA", "AA")) %>%            
            count(CARRIER)

Discover that in that location is the exclamation mark at the beginning of the status inside the filter() function. This is a very handy 'office' that basically flips the outcome of the condition that is after the exclamation marking. This is why the result above doesn't include 'UA' nor 'AA'. Information technology might look a scrap weird until yous go used to information technology especially if you're coming from outside of R world, but y'all are going to see this a lot and will appreciate its power and convenience.

Filtering out NA values

Now, permit'south go back to the original data once more.

          flight %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME)

When you look closer you'd detect that there are some NA values in ARR_DELAY cavalcade. You can get rid of them hands with 'is.na()' office, which would return True if the value is NA and Faux otherwise.

          flight %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%            
            filter(is.na(ARR_DELAY))

Oops, information technology looks similar all the values in ARR_DELAY are now NA, which is reverse of what I hoped. Well, as you lot saw already we tin now effort the '!' (exclamation mark) function once more like below.

          flight %>%            
            select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%            
            filter(!is.na(ARR_DELAY))

This is how yous can piece of work with NA values in terms of filtering the information.

This is the basic of how 'filter' works with dplyr. Simply this is only the beginning. Yous can do a lot more past combining with aggregate, window, string/text, and engagement functions, which I'1000 going to cover at the side by side post. Stay tuned!