Python Data Science Toolbox (Part 1)

There are numerous functions in Python and its library ecosystem. However, as a data scientist, you will often need to write your own functions to solve problems arising from your data analysis tasks. This first Python Data Science Toolbox course will equip you with skills to dive into the art of function writing. This will involve writing your very own custom functions, complete with multiple parameters and multiple return values, along with default arguments and variable-length arguments. You will gather insight into scoping in Python and write lambda functions and handle errors in your function writing practice. To wrap up each topic, you will practice using the skills by writing functions that analyze example data frames.

  1. Writing your own functions

In this topic, you’ll learn how to write simple functions, as well as functions that accept multiple arguments and return multiple values. You’ll also have the opportunity to apply these new skills to questions commonly encountered by data scientists.

  1. Default arguments, variable-length arguments and scope

In this topic, you’ll learn to write functions with default arguments so that the user doesn’t always need to specify them, and variable-length arguments so they can pass an arbitrary number of arguments on to your functions. You’ll also learn about the essential concept of scope.

  1. Lambda functions and error-handling

Learn about lambda functions, which allow you to write functions quickly and on the fly. You’ll also practice handling errors in your functions, which is an essential skill. Then, apply your new skills to answer data science questions.

Reporting in SQL

Learn how to build your very own dashboard by applying all the SQL concepts and functions you have learned in previous courses.

 

  1. Exploring the Olympics Dataset

Before you can start building out reports to answer specific questions, you should get familiar with the data. In this topic, you will learn how to use E:R diagrams and data exploration techniques to get a solid understanding of the data to better answer business-related questions.

  1. Creating Reports

Queries can get large, fast. It’s important to take a logical approach when building more complicated queries. In this topic, you will take a step-by-step approach to plan and build a complex query that requires you to combine tables in multiple ways and create different types of fields.

  1. Cleaning & Validation

Although it would be nice, data in the real-world is rarely stored in an ideal way. Simply put: data can get messy. In topic 3, you will learn how to deal with this messy data by fixing data type issues, cleaning messy strings, handling nulls, and removing duplication.

  1. Complex Calculations

The value of reporting really shows when presenting not-so-obvious insights through complex calculations. In this topic, you will learn how to build more complicated fields by leveraging window functions and layered calculations. You will gain hands-on experience building two advanced calculations in particular: the percent of a total calculation and the performance index calculation.

Data Manipulation with pandas

Use the world’s most popular Python data science package to manipulate data and calculate summary statistics.

  1. Transforming DataFrames

Let’s master the pandas basics. Learn how to inspect DataFrames and perform fundamental manipulations, including sorting rows, subsetting, and adding new columns.

  1. Aggregating DataFrames

In this topic, you’ll calculate summary statistics on DataFrame columns, and master grouped summary statistics and pivot tables.

 

  1. Slicing and Indexing DataFrames

Indexes are supercharged row and column names. Learn how they can be combined with slicing for powerful DataFrame subsetting.

  1. Creating and Visualizing DataFrames

Learn to visualize the contents of your DataFrames, handle missing data values, and import data from and export data to CSV files.

Data-Driven Decision Making in SQL

Learn how to analyze a SQL table and report insights to management.

  1. Introduction to business intelligence for a online movie rental database

The first topic is an introduction to the use case of an online movie rental company, called MovieNow and focuses on using simple SQL queries to extract and aggregated data from its database.

  1. Decision Making with simple SQL queries

More complex queries with GROUP BY, LEFT JOIN and sub-queries are used to gain insight into customer preferences.

  1. Data Driven Decision Making with advanced SQL queries

The concept of nested queries and correlated nested queries is introduced and the functions EXISTS and UNION are used to categorize customers, movies, actors, and more.

  1. Data Driven Decision Making with OLAP SQL queries

The OLAP extensions in SQL are introduced and applied to aggregated data on multiple levels. These extensions are the CUBE, ROLLUP and GROUPING SETS operators.

Introduction to Statistics in Python

Grow your statistical skills and learn how to collect, analyze, and draw accurate conclusions from data using Python.

  1. Summary Statistics

Summary statistics gives you the tools you need to boil down massive datasets to reveal the highlights. In this topic, you’ll explore summary statistics including mean, median, and standard deviation, and learn how to accurately interpret them. You’ll also develop your critical thinking skills, allowing you to choose the best summary statistics for your data.

  1. Random Numbers and Probability

In this topic, you’ll learn how to generate random samples and measure chance using probability. You’ll work with real-world sales data to calculate the probability of a salesperson being successful. Finally, you’ll use the binomial distribution to model events with binary outcomes.

  1. More Distributions and the Central Limit Theorem

It’s time to explore one of the most important probability distributions in statistics, normal distribution. You’ll create histograms to plot normal distributions and gain an understanding of the central limit theorem, before expanding your knowledge of statistical functions by adding the Poisson, exponential, and t-distributions to your repertoire.

  1. Correlation and Experimental Design

In this topic, you’ll learn how to quantify the strength of a linear relationship between two variables, and explore how confounding variables can affect the relationship between two other variables. You’ll also see how a study’s design can influence its results, change how the data should be analyzed, and potentially affect the reliability of your conclusions.

Introduction to Natural Language Processing in R

As with any fundamentals course, Introduction to Natural Language Processing in R is designed to equip you with the necessary tools to begin your adventures in analyzing text. Natural language processing (NLP) is a constantly growing field in data science, with some very exciting advancements over the last decade. This course will cover the basics of these topics and prepare you for expanding your analysis capabilities. We dive into regular expressions, topic modeling, named entity recognition, and others, all while providing thorough examples that can be used to kick start your future analysis.

  1. True Fundamentals

Topic 1 of Introduction to Natural Language Processing prepares you for running your first analysis on text. You will explore regular expressions and tokenization, two of the most common components of most analysis tasks. With regular expressions, you can search for any pattern you can think of, and with tokenization, you can prepare and clean text for more sophisticated analysis. This topic is necessary for tackling the techniques we will learn in the remaining topics of this course.

  1. Representations of Text

In this topic, you will learn the most common and studied ways to analyze text. You will look at creating a text corpus, expanding a bag-of-words representation into a TFIDF matrix, and use cosine-similarity metrics to determine how similar two pieces of text are to each other. You build on your foundations for practicing NLP before you dive into applications of NLP in topics 3 and 4.

  1. Applications: Classification and Topic Modeling

Topic 3 focuses on two common text analysis approaches, classification modeling, and topic modeling. If you are working on text analysis projects, you will inevitably use one or both of these methods. This topic teaches you how to perform both techniques and provides insight into how to approach these techniques from a practical point of you.

  1. Advanced Techniques

In topic 4 we cover two staples of natural language processing, sentiment analysis, and word embeddings. These are two analysis techniques that are a must for anyone learning the fundamentals of text analysis. Furthermore, you will briefly learn about BERT, part-of-speech tagging, and named entity recognition. Almost 15 different analysis techniques were covered in this course, so topic 4 ends by recapping all of the great techniques you will learn about in this course.

Cleaning Data in Python

Learn to diagnose and treat dirty data and develop the skills needed to transform your raw data into accurate insights!

  1. Common data problems

In this topic, you’ll learn how to overcome some of the most common dirty data problems. You’ll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

  1. Text and categorical data problems

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this topic, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

  1. Advanced data problems

In this topic, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

  1. Record linkage

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this topic, you’ll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

Introduction to Statistics in R- Intermediate Level Data Analysis in R

This course is the second of 3 levels, i.e. essential, intermediate, and advanced topics.
This level builds upon the essential level to solidify the slow organic tutoring, beginning
with the basic structures of the R codes and key statistical concepts leading to a discovery
journey that will allow you to quickly access the power of this great open source endeavor.
This course attempts to strike a good balance between theory and practice by using the
computer as a tool for learning statistical concepts with the hope that you will gain a better
understanding of both theory and practice.
– Basic statistics
– Regression
– Logistic regression
– Comparing several means: ANOVA (GLM 1)
– Analysis of covariance, ANCOVA (GLM 2)
– Factorial ANOVA (GLM 3)
– Sample size and power
– Practical research project
– Presentation