Today I finished the Google Data Analytics Professional Certificate

offered by Google and Coursera and would like to offer my review and course notes.

There are eight courses you have to finish before earning the final certificate:

Skills you will earn

  • Gain an immersive understanding of the practices and processes used by a junior or associate data analyst in their day-to-day job

  • Learn key analytical skills (data cleaning, analysis, & visualization) and tools (spreadsheets, SQL, R programming, Tableau)

  • Understand how to clean and organize data for analysis, and complete analysis and calculations using spreadsheets, SQL and R programming

  • Learn how to visualize and present data findings in dashboards, presentations and commonly used visualization platforms

     

Buy The Complete Data Analytics Notes Catalog

 

The final certificate

Google Data Analytics Professional Certificate

If also added my notes and summary for each course including code snippets, concepts and other stuff that you may need to pass the course, summarize what you have learnt and keep the notes as they may come in handy when you need them

You can download my collection of summaries and notes from below links:

R Programming 

Excel Analytics

Data Visualizations

SQL Programming

Video Review

Summary of the concepts that you will learn:

# The Six steps in the data analysis process
Ask questions and define the problem.
Prepare data by collecting and storing the information.
Process data by cleaning and checking the information.
Analyze data to find patterns, relationships, and trends.
Share data with your audience.
Act on the data and use the analysis results.

# Data Ecosystem
The various elements that interact with one another to produce, manage, store, organize, analyze, and share data.

# A technical mindset
The analytical skill that involves breaking processes down into smaller steps and working with them in an orderly, logical way

# Data design
Analytical skills that involve how you organize information

# Data science
A field of study that uses raw data to create new ways of modeling and understanding the unknown

# Data strategy
The management of the people, processes, and tools used in data analysis

# Gap analysis
A method for examining and evaluating the current state of a process in order to identify opportunities for improvement in the future

# Query language
A computer programming language used to communicate with a database

# Data Life Cycle vs Data Analysis
The data life cycle deals with the stages that data goes through during its useful life; data analysis is the process of analyzing data.

# Formula vs Function
A formula is a set of instructions used to perform a specified calculation; a function is a preset command that automatically performs a specified process

# The six problems a data analyst work with:
Making predictions
Categorizing things
###### A data analyst identifying keywords from customer reviews and labeling them as positive or neutral is an example of categorizing things.
Spotting something unusual
###### The problem type of spotting something unusual could involve a data analyst examining why a dataset has a surprising and rare data point. Spotting something unusual deals with identifying and analyzing something out of the ordinary.
-Identify themes
###### User experience (UX) designers might rely on analysts to analyze user interaction data. Similar to problems that require analysts to categorize things, usability improvement projects might require analysts to identify themes to help prioritize the right product features for improvement. Themes are most often used to help researchers explore certain aspects of data. In a user study, user beliefs, practices, and needs are examples of themes.
###### By now you might be wondering if there is a difference between categorizing things and identifying themes, The best way to think about it is this: categorizing things generally classifies the same things together, like a product score of 10, while identifying themes classifies similar things which may not be the same, like positive user feedback; each user says something different, but they are communicating positive things about the product, which becomes a theme.-Discovering connections
-Finding patterns
###### Finding patterns deals with identifying trends in a data set.

# Smart questions are:
-specific: does the question have context and address problem. and do the answers help collection information
to just specific element or closely related ones.
-measurable: answers can be measured and collected to be classified and rated to see which are most and least
important
-action-oriented: when answered helps making decisions that focus on solving specific problem or inventing
new feature.
-relevant: is it about the problem?
-time-bound:will the answers solve the problem sooner than later? can a plan be created to implements
solutions that buyers prefer and cut back on least important features?

# Structured thinking
Revealing gaps and opportunities
Recognizing the current problem or situation
Organizing available information

# Categorizing things involves assigning items to categories. Identifying themes takes those categories a step further, grouping them into broader themes or classifications.

# Qualitative vs Quantitative Data
Qualitative data can help analysts better understand their quantitative data by providing a reason or more thorough explanation. In other words, quantitative data generally gives you the what, and qualitative data generally gives you the why

# Dashboards monitor live, incoming data from multiple datasets and organize the information into one central location.

# Data vs metrics
Data is a collection of facts. Metrics are quantifiable data types used for measurement

# Algorithm
a process or set of rules to be followed for a specific task

# Metric
A metric is a single, quantifiable type of data used when setting and evaluating goals.

# Structured thinking is the process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying the options.

# Function vs formula

Formulas are created by the user, whereas functions are preset commands in spreadsheets

# The four questions for effective communication strategy (mainly used in emails)
Who is your audience?
What do they already know?
What do they need to know?
How can you best communicate what they need to know?

# First Party Data
Data that you collect yourself

# Second Party Data
The data that is collected directly by another group and then sold.

# Third Party Data
Third-party data might come from a number of different sources.
Third-party data is sold by a provider that didn’t collect the data themselves.

# If you are collecting your own data, make reasonable decisions about sample size

# A random sample from existing data might be fine for some projects

# Observation is the method of data-collection most often used by scientists.

# Primary Data
Collected by a researcher from first-hand sources
ex: Data from an interview you conducted

# Secondary data
Gathered by other people or from other research
Demographic data collected by a university

# Continuous data
Data that is measured and can have almost any numeric value
Height of kids in third grade classes (52.5 inches, 65.7 inches)

# Discrete data
Data that is counted and has a limited number of values
Number of people who visit a hospital on a daily basis (10, 20, 200)

# Nominal Data
A type of qualitative data that isn’t categorized with a set order
First time customer, returning customer, regular customer

# Ordinal Data
A type of qualitative data with a set order or scale
Movie ratings (number of stars: 1 star, 2 stars, 3 stars)

# Structured data
Data organized in a certain format, like rows and columns
Expense reports

# Unstructured data
Data that isn’t organized in any easily identifiable manner
Social media posts

# Data modeling is the process of creating diagrams that visually represent how data is organized and structured.

These visual representations are called data models
# Data Modeling types
###### Conceptual data modeling gives you a high-level view of your data structure, such as how you want data to interact across an organization.
###### Logical data modeling focuses on the technical details of the model such as relationships, attributes, and entities
###### Physical data modeling should actually depict how the database was built. By this stage, you are laying out how each database will be put in place and how the databases, applications, and features will interact in specific detail
# Data modeling techniques
ERDs are a visual way to understand the relationship between entities in the data model
UMLs are very detailed diagrams that describe the structure of a system by showing the system’s entities, attributes, operations, and the relationships
# Data transformation is the process of changing the data’s format, structure, or values
# Long data is data in which each row is a data point for an individual subject. Each subject has data in multiple rows.
# Wide data is data in which each data subject has a single row with multiple columns for the values of the various attributes (or variables) of the subject
# The Boolean operator Or expands the number of results when used in a keyword search

# De-identification
A process used to wipe data clean of all personally identifying information
# A relational database is a database that contains a series of tables that can be connected to show relationships.
Basically, they allow data analysts to organize and link data based on what the data has in common.
# A relational database is a database that contains a series of tables that can be connected to show relationships.
Basically, they allow data analysts to organize and link data based on what the data has in common.
# Primary key
A unique identifier in a table that references a column where the value of that key in every row is unique.
# Foreign key
A field in a table and is a primary key in another table.
# A table can only have one primary key, but it can have multiple foreign keys.
These keys are what create the relationships between tables in a relational database,
which helps organize and connect data across multiple tables in the database.
# Normalizing a database is a technique to reduce data redundancy
# A schema is a way of describing how something is organized
# A database schema represents any kind of structure that is applied to the database

# Two commonly used schemas are star schemas and snowflake schemas
# star schema is simple, isn’t normalized, and has a lot of data redundancy
# A snowflake schema is complex, is normalized, and has very little data redundancy

# column chart
A column chart is effective at demonstrating the differences between several items in a specific range of values
# Line chart
Line charts are effective for demonstrating trends and patterns, such as how population changes over time.
# Structural metadata indicates exactly how many collections the data lives in.
It provides information about how a piece of data is organized and whether it’s part of one, or more than one, data collection.
# Data governance ensures that a company’s data assets are properly managed.
# The date and time a database was created is an example of administrative metadata.
# Tokenization replaces the data elements you want to protect with randomly generated data referred to as a “token.”
The original data is stored in a separate location and mapped to the tokens.
To access the complete original data, the user or application needs to have permission to use the tokenized data and the token mapping.
This means that even if the tokenized data is hacked, the original data is still safe and secure in a separate location.

**Data analysts should think about modifying a business objective when the data doesn’t align with the original objective and when there is not enough data to meet the objective**

**Data being used for analysis should align with business objectives and help answer stakeholder questions**

# What to do when you find an issue with your data

## Data issue 1: no data

If there isn’t time to collect data, perform the analysis using proxy data from other datasets. _This is the most common workaround._

If you are analyzing peak travel times for commuters but don’t have the data for a particular city, use the data from another city with a similar size and demographic.

## Data issue 2: too little data

Do the analysis using proxy data along with actual data.

If you are analyzing trends for owners of golden retrievers, make your dataset larger by including the data from owners of labradors.

Adjust your analysis to align with the data you already have.

If you are missing data for 18- to 24-year-olds, do the analysis but note the following limitation in your report: _this conclusion applies to adults 25 years and older_ _only_.

## Data issue 3: wrong data, including data with errors

Possible Solutions

If you have the wrong data because requirements were misunderstood, communicate the requirements again.

If you need the data for female voters and received the data for male voters, restate your needs.

Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors.

If your data is in a spreadsheet and there is a conditional statement or boolean causing calculations to be wrong, change the conditional statement instead of just fixing the calculated values.

If you can’t correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias.

If your dataset was translated from a different language and some of the translations don’t make sense, ignore the data with bad translation and go ahead with the analysis of the other data.

![[Data-collection-notes.jpg]]

**Population**

The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.

**Sample**

A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population.

**Margin of error**

Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population.

Margin of error is used to determine how close your sample’s result is to what the result would likely have been if you could have surveyed or tested the entire population. Margin of error helps you understand and interpret survey or test results in real-life.  Calculating the margin of error is particularly helpful when you are given the data to analyze. After using a calculator to calculate the margin of error, you will know how much the sample results might differ from the results of the entire population

**Confidence level**

How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study.

In most cases, a 90% or 95% confidence level is used. But, depending on your industry, you might want to set a stricter confidence level. A 99% confidence level is reasonable in some industries, such as the pharmaceutical industry

**Confidence interval**

The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.

**Statistical significance**

The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.

**In order for an experiment to be statistically significant, the results should be real and not caused by random chance.**

**In order to have a high confidence level in a customer survey, the sample size should accurately reflect the entire population.**

## Types of dirty data

Duplicate data
Outdated data
Incomplete data
Incorrect/inaccurate data
Inconsistent data

**A null indicates that a value does not exist. A zero is a numerical response.**

**Data mapping is the process of matching fields from one data source to another.**

# Documentation

Engineers use **engineering change orders** (ECOs) to keep track of new product design details and proposed changes to existing products. Writers use **document revision histories** to keep track of changes to document flow and edits. And data analysts use **changelogs** to keep track of data transformation and cleaning

Changelogs are super useful for helping us understand the reasons changes have been made. Changelogs have no set format and you can even make your entries in a blank document. But if you are using a shared changelog, it is best to agree with other data analysts on the format of all your log entries

A junior analyst probably only needs to know the above with one exception. If an analyst is making changes to an existing SQL query that is shared across the company, the company most likely uses what is called a **version control system**. An example might be a query that pulls daily revenue to build a dashboard for senior management.

# Version control system

Here is how a version control system affects a change to a query:

1. A company has official versions of important queries in their **version control system**.
2. An analyst makes sure the most up-to-date version of the query is the one they will change. This is called **syncing**
3. The analyst makes a change to the query.
4. The analyst might ask someone to review this change. This is called a **code review** and can be informally or formally done. An informal review could be as simple as asking a senior analyst to take a look at the change.
5. After a reviewer approves the change, the analyst submits the updated version of the query to a repository in
6. the company’s version control system. This is called a **code commit**. A best practice is to document exactly what the change was and why it was made in a comments area. Going back to our example of a query that pulls daily revenue, a comment might be: _Updated revenue to include revenue coming from the new product, Calypso_.
7. After the change is **submitted**, everyone else in the company will be able to access and use this new query when they **sync** to the most up-to-date queries stored in the version control system.
8. If the query has a problem or business needs change, the analyst can **_undo_** the change to the query using the version control system. The analyst can look at a chronological list of all changes made to the query and who made each change. Then, after finding their own change, the analyst can **revert** back to the previous version.
9. The query is back to what it was before the analyst made the change. And everyone at the company sees this reverted, original query, too.

Without enough data to identify long-term trends, one option is to talk with stakeholders and ask to adjust the objective. You could also ask to wait for more data and provide an updated timeline.

**Outliers** are data points that are very different from similarly collected data and might not be reliable values

## Sorting versus filtering

**Sorting** is when you arrange data into a meaningful order to make it easier to understand, analyze, and visualize. It ranks your data based on a specific metric you choose. You can sort data in spreadsheets, SQL databases (when your dataset is too large for spreadsheets), and tables in documents.

For example, if you need to rank things or create chronological lists, you can sort by ascending or descending order. If you are interested in figuring out a group’s favorite movies, you might sort by movie title to figure it out. Sorting will arrange the data in a meaningful way and give you immediate insights. Sorting also helps you to group similar data together by a classification. For movies, you could sort by genre — like action, drama, sci-fi, or romance.

**Filtering** is used when you are only interested in seeing data that meets a specific criteria, and hiding the rest. Filtering is really useful when you have lots of data. You can save time by zeroing in on the data that is really important or the data that has bugs or errors. Most spreadsheets and SQL databases allow you to filter your data in a variety of ways. Filtering gives you the ability to find what you are looking for without too much effort.

For example, if you are only interested in finding out who watched movies in October, you could use a filter on the dates so only the records for movies watched in October are displayed. Then, you could check out the names of the people to figure out who watched movies in October.

**In the data analysis process, the goal of analysis is to identify trends and relationships within that data so you can accurately answer the question you’re asking.**

About the Author

I create cybersecurity notes, digital marketing notes and online courses. I also provide digital marketing consulting including but not limited to SEO, Google & Meta ads and CRM administration.

View Articles