R Programming Notes For IBM Data Analyst Certificate

Installing a package
install.packages(“tidyverse”)
Loading a package
library(tidyverse)
The installed.packages() function shows a list of packages currently installed in an RStudio session. You can then locate the names of the packages and what’s needed to use functions from the package.
installed.packages()
CRAN is a commonly used online archive with R packages and other R resources. CRAN makes sure that the Rresources it shares follow the required quality standards and
are authentic and valid The dplyr package is the tidyverse package which contains a set of functions, such as select(), that help with data manipulation. For example, select() selects only relevant variables based on their names.

Vectors

A vector is a group of data elements of the same type, stored in a sequence in R. You cannot have a vector that contains both logicals and numerics.

There are two types of vectors: atomic vectors and lists There are six primary types of atomic vectors: logical, integer, double, character (which contains strings), complex,
and raw.

One way to create a vector is by using the c() function (called the “combine” function). The c() function in R combines multiple values into a vector. In R, this function is just the letter “c” followed by the values you want in your vector inside the parentheses, separated by a comma: c(x, y, z, …)

Every vector you create will have two key properties: type and length.

You can determine what type of vector you are working with by using the typeof() function. Place the code for the vector inside the parentheses of the function. When you run the function, R will tell you the type.

You can determine the length of an existing vector–meaning the number of elements it contains–by using the length() function.

You can also check if a vector is a specific type by using an is function: is.logical(), is.double(), is.integer(), is.character().

All types of vectors can be named. Names are useful for writing readable code and describing objects in R. You can name the elements of a vector with the names() function.

Code summary for vectors

x <- c(1, 3, 5) ### creating vector and assgin it to x
names(x) <- c(“a”,”b”,”c”) ### creating names for the elements
is.character(x) ### checking if the vector is a character
vector
typeof(x) ### checking the vector type.

Lists

Lists are dierent from atomic vectors because their elements can be of any type—like dates, data frames, vectors, matrices, and more. Lists can even contain other lists.

You can create a list with the list() function. Similar to the c() function, the list() function is just list followed by the values you want in your list inside parentheses.

If you want to find out what types of elements a list contains, you can use the str() function.

Code

list(“a”, 1L, 1.5, TRUE)
str(list(“a”, 1L, 1.5, TRUE))
z <- list(list(list(1 , 3, 5)))
str(z)
### Naming Lists
list(“Chicago” = 1,“New York” = 2,“Los Angeles” = 3)

Date and Time

In R, there are three types of data that refer to an instant in time:

A date (“2016-08-16”)
A time within a day (“20-11-59 UTC”)
And a date-time. This is a date plus a time (“2018-03-31
18-15-48 UTC”).

Converting from strings

Date/time data often comes as strings. You can convert strings into dates and date-times using the tools provided by lubridate. These tools automatically work out the date/time format. First, identify the order in which the year, month, and day appear in your dates. Then, arrange the letters y, m, and d in the same order. That gives you the name of the lubridate function that will
parse your date. For example, for the date 2021-01-20, you use the order ymd:

Code

ymd(“2021-01-20”)
mdy(“January 20th, 2021”)
dmy(“20-Jan-2021”)
ymd(20210120)

### output for all is below
#> [1] “2021-01-20”

Creating date-time components

The ymd() function and its variations create dates. To create a date-time from a date, add an underscore and one or more of the letters h, m, and s (hours, minutes, seconds) to the name of the function

Code

ymd_hms(“2021-01-20 20:11:59”)
### #> [1] “2021-01-20 20:11:59 UTC”
mdy_hm(“01/20/2021 08:01”)
### #> [1] “2021-01-20 08:01:00 UTC”

Switching between existing date-time objects

You can use the function as_date() to convert a date-time to a date. For example, put the current date-time—now()—in the parentheses of the function

Code

as_date(now())
#> [1] “2021-01-20”

Data frames

A data frame is a collection of columns–similar to a spreadsheet or SQL table. Each column has a name at the top that represents a variable, and includes one observation per row. Data frames help summarize data and organize it into a format that is easy to read and use.

If you need to manually create a data frame in R, you can use the data.frame() function. The data.frame() function takes vectors as input. In the parentheses, enter the name of the column, followed by an equals sign, and then the vector you want to input for that column.

The mutate() function can be used to make changes to a data
frame

Code

### The _x_ column is a vector with elements 1, 2, 3, and the _y_ column is a vector with elements 1.5, 5.5, 7.5
data.frame(x = c(1, 2, 3) , y = c(1.5, 5.5, 7.5))

Files

Use the dir.create function to create a new folder, or directory, to hold your files. Place the name of the folder in the parentheses of the function.

Use the file.create() function to create a blank file. Place the name and the type of the file in the parentheses of the function. Your file types will usually be something like .txt, .docx, or .csv.

Copying a file can be done using the file.copy() function. In the parentheses, add the name of the file to be copied. Then, type a comma, and add the name of the destination folder that you want to copy the file to.

You can delete R files using the unlink() function. Enter the file’s name in the parentheses of the function.

Code

dir.create (“destination_folder”)
file.create (“new_text_file.txt”)
file.create (“new_word_file.docx”)
file.create (“new_csv_file.csv”)
file.copy (“new_text_file.txt”,“destination_folder”)
unlink (“some_.file.csv”)

Matrices

A matrix is a two-dimensional collection of data elements. This means it has both rows and columns. By contrast, a vector is a one-dimensional sequence of data elements. But like vectors, matrices can only contain a single data type. For example, you can’t have both logicals and numerics in a matrix.

To create a matrix in R, you can use the matrix() function. The matrix() function has two main arguments that you enter in the parentheses. First, add a vector. The vector contains the values you want to place in the matrix. Next, add at least one matrix dimension. You can choose to specify the number of rows or the number of columns by using the code nrow = or ncol =.

For example, imagine you want to create a 23 (two rows by three columns) matrix containing the values 38. First, enter a vector containing that series of numbers: c(38. Then, enter a comma. Finally, enter nrow = 2 to specify the number of rows.

You can also choose to specify the number of columns (ncol = ) instead of the number of rows (nrow = ).

Code

matrix(c(3:8), nrow = 2)
matrix(c(3:8), ncol = 2)

Logical operators and conditional statements

Logical operators return a logical data type such as TRUE or FALSE.
There are three primary types of logical operators:
● AND (sometimes represented as & or && in R)
● OR (sometimes represented as | or || in R)
● NOT (!)

Let’s discuss how to create conditional statements in R using three related statements:
● if()
● else()
● else if()
The if statement sets a condition, and if the condition evaluates to TRUE, the R code associated with the if statement is executed.
if (x > 0) {
print(“x is a positive number”)
}
The else statement is used in combination with an if statement. This is how the code is structured in R:

Code
x <- 7
if (x > 0) {
print(“x is a positive number”)
}
else {
print (“x is either a negative number or zero”)
}
In some cases, you might want to customize your conditional statement even further by adding the else if statement. The else if statement comes in between the if statement and the else statement.

Code
x <- -1
if (x < 0) {
print(“x is a negative number”)
}
else if (x == 0) {
print(“x is zero”)
}
else {
print(“x is a positive number”)
}
The main difference between element-wise logical operators (&,|) and logical operators (&&, ||) is the way they apply to operations with vectors. The operations with double signs, AND
(&&) and logical OR (||), only examine the first element of each vector. The operations with single signs, AND (&) and OR (|), examine all the elements of each vector.

A pipe is a tool for expressing a sequence of multiple operations in R (in this case filtering and grouping). The operator for a pipe is %>%.

Code

mtcars %>%
filter(carb > 1) %>%
group_by(cyl) %>%

Tibbles

Tibbles are like streamlined data frames that are automatically set to pull up only the first 10 rows of a dataset, and only as many columns as can fit on the screen. Overall, you can make more changes to data frames, but tibbles are easier to use.

Code

### loading tidyverse
library(tidyverse)
### loading diamonds dataset
data(diamonds)
### view the dataset
View(diamonds)
### create tje tibble from the dataset
as_tibble(diamonds)

Data import

You can use the data() function to load these datasets in R. If you run the data function without an argument, R will display a list of the available datasets.
If you want to load a specific dataset, just enter its name in the parentheses of the data() function

readr

The readr package is part of the core tidyverse.In addition to using R’s built-in datasets, it is also helpful to import data from other sources to use for practice or analysis. The readr package in R is a great tool for reading rectangular data. Rectangular data is data that fits nicely inside a rectangle of rows and columns, with each column referring to a single variable and each row referring to a single observation.

The goal of readr is to provide a fast and friendly way to read rectangular data. readr supports several read_ functions. Each function refers to a specific file format.

read_csv(): comma separated (CSV) files
read_tsv(): tab separated files
read_delim(): general delimited files
read_fwf(): fixed width files
read_table(): tabular files where columns are separated by
white-space
read_log(): web log files

Code

### To list the sample files, you can run the
readr_example() function with no arguments
readr_example()
### When you run the function, R prints out a column specification that gives the name and type of each column
read_csv(readr_example(“mtcars.csv”))
read_csv(“mtcars.csv”)

readxl

To import spreadsheet data into R, you can use the readxl package. The readxl package makes it easy to transfer data from Excel into R. Readxl supports both the legacy .xls file format and the modern xml-based .xlsx file format.

Code

library(readxl)
readxl_example()
read_excel(readxl_example(“type-me.xlsx”))
### You can use the excel_sheets() function to list the names of the individual sheets
excel_sheets(readxl_example(“type-me.xlsx”))
### You can also specify a sheet by name or number. Just type “sheet =” followed by the name or number of the sheet. For example, you can use the sheet named “numeric_coercion” from the list above.
read_excel(readxl_example(“type-me.xlsx”), sheet =“numeric_coercion”)
### When you run the function, R returns a tibble of the sheet

Operators

In R, there are four main types of operators:

Arithmetic
Relational
Logical
Assignment

Tidy data

There are compelling reasons to use both formats. But as an analyst, it is important to know how to tidy data when you need to. In R, you may have a data frame in a wide format that has several variables and conditions for each variable. It might feel a bit messy.

That’s where pivot_longer()comes in. As part of the tidyr package, you can use this R function to lengthen the data in a data frame by increasing the number of rows and decreasing the number of columns. Similarly, if you want to convert your data to have more columns and fewer rows, you would use the pivot_wider() function.

Visualizing data with ggplot2

The ggplot2 package lets you make high quality, customizable plots of your data. As a refresher, ggplot2 is based on the grammar of graphics, which is a system for describing and building data visualizations. The essential idea behind the grammar of graphics is that you can build any plot from the same basic components, like building blocks.

These building blocks include:

A dataset
A set of geoms: A geom refers to the geometric object used to represent your data. For example, you can use points to create a scatterplot, bars to create a bar chart, lines to create a line diagram, etc.
A set of aesthetic attributes: An aesthetic is a visual property of an object in your plot. You can think of an aesthetic as a connection, or mapping, between a visual feature in your plot
and a variable in your data. For example, in a scatterplot, aesthetics include things like the size, shape, color, or location (x-axis, y-axis) of your data points.

To create a plot with ggplot2, you first choose a dataset. Then, you determine how to visually organize your data on a coordinate system by choosing a geom to represent your data points and aesthetics to map your variables.

Code

install.packages(‘ggplot2’)
install.packages(‘dplyr’)
### Install dataset
install.packages(‘palmerpenguins’)
library(ggplot2)
library(dplyr)
### load the dataset
library(palmerpenguins)
data(penguins)
### View the dataset
View(penguins)
### ggplot(data = penguins):** In ggplot2, you begin a plot with the ggplot() function. The ggplot() function creates a coordinate system that you can add layers to. The first argument of the ggplot() function is the dataset to use in the plot. In this case, it’s “penguins.”
### Then, you add a “+” symbol to add a new layer to your plot. You complete your plot by adding one or more layers to ggplot().
### geom_point()**: Next, you choose a geom by adding a geom function. The geom_point() function uses points to create scatterplots, the geom_bar function uses bars to create bar charts, and so on. In this case, choose the geom_point function to create a scatter plot of points. The ggplot2 package comes with many different geom functions. You’ll learn more about geoms later in this course.
### (mapping = aes(x = flipper_length_mm, y = body_mass_g))**: Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with the aes() function. The x and y arguments of the aes() function specify whichvariables to map to the x-axis and the y-axis of the coordinate system. In this case, you want to map the variable “flipper_length_mm” to the x-axis, and the variable “body_mass_g” to the y-axis.
ggplot(data = penguins) + geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
### or specify also aesthetics attributes
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g,color=carrier, size=air_time, shape = carrier )) + geom_point()
### Smoothing** enables the detection of a data trend even when you can’t easily notice a trend from the plotted data points. Ggplot2’s smoothing functionality is helpful because it adds a **smoothing line** as another layer to a plot; the smoothing line helps the data to make sense to a casual observer
### **Gam smoothing** useful for smoothing plots with a large number of points.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() + geom_smooth(method=”gam” , formula = y ~s(x))
### **Loess smoothing** The loess smoothing process is best for smoothing plots with less than 1000 points.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() + geom_smooth(method=”loess”)
### geom_jitter()
### The analyst could use the geom_jitter() function to make the points easier to find. The geom_jitter() function adds a small amount of random noise to each point in the plot, which helps deal with the overlapping of points.
### The facet_wrap(~ variable_name) function lets you display smaller groups, or subsets, of your data.
### labs () to create a title for your visualization and annotate () to add notes to your plot.
### **ggsave(‘filename.jpg’)** to save your plot

Documentation and reports

R Markdown is a useful tool that allows you to save and execute code, and generate shareable reports for stakeholders.
R Markdown is a file format for making dynamic documents with R. These documents, also known as notebooks, are records of analysis that help you, your team members, and stakeholders understand what you did in your analysis to reach your conclusions. You can publish a notebook as an html, pdf, or Word file, or in another format like a slideshow.

Functions

– arrange ()

The dplyr function arrange() can be used to reorder (or sort) rows by one or more variables.

Reorder rows by Sepal.Length in ascending order
Reorder rows by Sepal.Length in descending order. Use the function desc():
Reorder rows by multiple variables: Sepal.Length and Sepal.width

Code

my_data %>% arrange(Sepal.Length)
my_data %>% arrange(desc(Sepal.Length))
arrange(my_data, -Sepal.Length)

– as_data_frame()

Convert loaded data into tibble

Code

# Create my_data
my_data <- iris
# Convert to a tibble library(“tibble”)
my_data <- as_data_frame(my_data)
# Print
my_data

– data()

To list available datasets

– unite()
The unite() function can be used to combine columns
– clean_names()
The clean_names() function will automatically make sure that column names are unique and consistent
– colnames(dataset or dataframe)

Get a list of the column names

– skim_without_charts(dataset) or glimpse () or summary ()

Get a comprehensive view and information about the dataset.

– filter ()
The filter function allows the data analyst to specify which part of the data they want to view

Code

Question 5

A data analyst is working with the penguins data. They
write the following code:
penguins %>%
The variable _species_ includes three penguin species: Adelie, Chinstrap, and Gentoo. What code chunk does the analyst add to create a data frame that only includes the
Gentoo species?
filter(species == “Gentoo”)

– mutate ()

Manipulate dataframe and columns

Code

Question 7
A data analyst is working with a data frame called _salary_data_. They want to create a new column named _total_wages_ that adds together data in the _standard_wages_ and _overtime_wages_ columns. What code chunk lets the analyst create the _total_wages_ column?
mutate(salary_data, total_wages = standard_wages + overtime_wages)

– bias()

The bias() function can be used to calculate theaverage amount a predicted outcome and actual outcome differ in order to determine if the data model is biased.

Case Study

As part of the data science team at Gourmet Analytics, you use data analytics to advise companies in the food industry. You clean, organize, and visualize data to arrive at insights that will benefit your clients. As a member of a collaborative team, sharing your analysis with others is an important part of your job.

Your current client is Chocolate and Tea, an up-and-coming chain of cafes.

The eatery combines an extensive menu of fine teas with chocolate bars from around the world. Their diverse selection includes everything from plantain milk chocolate, to tangerine white chocolate, to dark chocolate with pistachio and fig. The encyclopedic list of chocolate bars is the basis of Chocolate and Tea’s brand appeal. Chocolate bar sales are the main driver of revenue.

Chocolate and Tea aims to serve chocolate bars that are highly rated by professional critics. They also continually adjust the menu to make sure it reflects the global diversity of chocolate production. The management team regularly updates the chocolate bar list in order to align with the latest ratings and to ensure that the list contains bars from a variety of countries.

They’ve asked you to collect and analyze data on the latest chocolate ratings. In particular, they’d like to know which countries produce the highest-rated bars of super dark chocolate (a high percentage of cocoa). This data will help them create their next chocolate bar menu.

Code

library(tidyverse)
### Before you begin working with your data, you need to import it and save it as a data frame. To get started, you open your RStudio workspace and load the tidyverse library. You upload a .csv file containing the data to RStudio and store it in a project folder named flavors_of_cacao.csv.
### **You use the read_csv() function to import the data from the .csv file. Assume that the name of the data frame is bars_df and the .csv file is in the working directory.** **What code chunk lets you create the data frame?**
bars_df <- read_csv(“flavors_of_cacao.csv”)
### Now that you’ve created a data frame, you want to find out more about how the data is organized. The data frame has hundreds of rows and lots of columns.
### **Assume the name of your data frame is flavors_df.**
**What code chunk lets you review the column names in the data frame?**
colnames(flavors_df)
### Next, you begin to clean your data. When you check out the column headings in your data frame you notice that the first column is named _Company…Maker.if.known._ (Note: The period after _known_ is part of the variable name.) For the sake of clarity and consistency, you decide to rename this column _Company_ (without a period at the end).
rename(Company…Maker.if.known. <- Company)
### After previewing and cleaning your data, you determine what variables are most relevant to your analysis. Your main focus is on _Rating_, _Cocoa.Percent_, and _Company_. You decide to use the select() function to create a new data frame with only these three variables. **Add the code chunk that lets you select the three variables.**
select (Rating, Cocoa.Percent, Company)
### Next, you select the basic statistics that can help your team better understand the ratings system in your data.
### **Assume the first part of your code is:**
### trimmed_flavors_df %>%
### **You want to use the summarize() and max() functions to find the maximum rating for your data. Add the code chunk that lets you find the maximum value for the variable** **_Rating_**
###After completing your analysis of the rating system, you determine that any rating greater than or equal to 3.9 points can be considered a high rating. You also know that Chocolate and Tea considers a bar to be super dark chocolate if the bar’s cocoa percent is greater than or equal to 75%. You decide to create a new data frame to find out which chocolate bars meet these two conditions.
###**Assume the first part of your code is:**
### best_trimmed_flavors_df <- trimmed_flavors_df %>%
###**You want to apply the filter() function to the variables** **_Cocoa.Percent_** **and** **_Rating_****. Add the code chunk that lets you filter the data frame for chocolate bars that contain at least 75% cocoa and have a rating of at least 3.9 points.**
filter (Cocoa.Percent >= ‘75%’ & Rating >= 3.9)
### Now that you’ve cleaned and organized your data, you’re ready to create some useful data visualizations. Your team assigns you the task of creating a series of visualizations based on requests from the Chocolate and Tea management team. You decide to use ggplot2 to create your visuals.
### **Assume your first line of code is:**
### ggplot(data = best_trimmed_flavors_df) +
### **You want to use the geom_bar() function to create a bar chart. Add the code chunk that lets you create a bar chart with the variable** **_Rating_** **on the x-axis**
geom_bar(mapping = aes(x =Rating))
### Your bar chart reveals the locations that produce the highest rated chocolate bars. To get a better idea of the specific rating for each location, you’d like to highlight each bar.
### **Assume that you are working with the following code:**
### ggplot(data = best_trimmed_flavors_df) +
### geom_bar(mapping = aes(x = Company.Location))
### **Add a code chunk to the second line of code to map the aesthetic** **_fill_** **to the variable** **_Rating_****.**
### **NOTE: the three dots (…) indicate where to add the code chunk.**
geom_bar(mapping = aes(x = Company.Location, fill=Rating))
### A teammate creates a new plot based on the chocolatebar data. The teammate asks you to make some revisions to their code.
### **Assume your teammate shares the following code
chunk:**
### ggplot(data = best_trimmed_flavors_df) +
geom_bar(mapping = aes(x = Company)) +
### **What code chunk do you add to the third line to create wrap around facets of the variable** **_Company_**
facet_wrap(~Company)
### Your team has created some basic visualizations to explore different aspects of the chocolate bar data. You’ve volunteered to add titles to the plots. You begin with a scatterplot.
### **Assume the first part of your code chunk is:**
### ggplot(data = trimmed_flavors_df) + geom_point(mapping = aes(x = Cocoa.Percent, y = Rating)) +
### **What code chunk do you add to the third line to add the title** **_Suggested Chocolate_** **to your plot**
labs(title = “Suggested Chocolate”)
### Next, you create a new scatterplot to explore the relationship between different variables. You want to save your plot so you can access it later on. You know that the ggsave() function defaults to saving the last plot that you displayed in RStudio, so you’re ready to write the code to save your scatterplot.
### **Assume your first two lines of code are:**
### ggplot(data = trimmed_flavors_df) + geom_point(mapping = aes(x = Cocoa.Percent, y = Rating))
### **What code chunk do you add to the third line to save your plot as a jpeg file with** **_chocolate_** **as the file name**
ggsave(“chocolate.jpeg”)

IBM Data Analyst, R programming

Show Comments

The MasterMinds Notes

About the Author

Mastermind Study Notes is a group of talented authors and writers who are experienced and well-versed across different fields. The group is led by, Motasem Hamdan, who is a Cybersecurity content creator and YouTuber.

View Articles

R Programming Notes For Data Analysts