Exploratory Data Analysis in SparkSQL: Hands-on
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
Spark SQL is a very important and most used module that is used for structured data processing. Spark SQL allows you to query structured data using either SQL or DataFrame API.
Here is some examples of EDA using sparkSQL at Data bricks environments. Let’s get started with implementing EDA on Sample Data sets
- Create tables
- Create temporary views
- Write basic SQL queries to explore, manipulate, and present data
- Join two views and visualize the result
Working with Retail Data
For this lab, we’ll be working with a generated dataset meant to mimic data collected for online retail transactions. You can use the following path to access the data: Link
Exercise 1: Create a table
Summary: Create a new table named outdoorProducts
with the following schema:
Steps to complete:
- Make sure this notebook is idempotent by dropping any tables that have the name
outdoorProducts
- Use
csv
as the specified data source - Use the path provided above to access the data
- This data contains a header; include that in your table creation statement
DROP TABLE IF EXISTS outdoorProducts;
CREATE TABLE outdoorProducts (
invoiceNo STRING,
stockCode STRING,
description STRING,
quantity INT,
invoiceDate STRING,
unitPrice DOUBLE,
customerID STRING,
countryName STRING
) USING csv OPTIONS (
path "/mnt/training/online_retail/data-001/data.csv",
header "true"
);
Exercise 2: Explore the data
Summary: Count the number of items that have a negative quantity
This table keeps track of online transactions, including returns. Some of the quantities in the quantity
column show a negative number. Run a query that counts then number of negative values in the quantity
column.
Steps to complete:
- Write a query that reports the number of values less than 0 in the
quantity
column
SELECT
count(*)
FROM
outdoorProducts
WHERE
quantity < 0;
Here is Output
Exercise 3: Create a temporary view
Summary: Create a temporary view that includes only the specified columns and rows, and uses math to create a new column.
Steps to complete:
- Create a temporary view named
sales
- Create a new column,
totalAmount
, by multiplyingquantity
timesunitPrice
and rounding to the nearest cent - Include columns:
stockCode
,quantity
,unitPrice
,totalAmount
,countryName
- Include only rows where
quantity
is greater than 0
CREATE
OR REPLACE TEMPORARY VIEW sales AS
SELECT
stockCode,
quantity,
unitPrice,
ROUND(quantity * unitPrice, 2) AS totalAmount,
countryName
FROM
outdoorProducts
WHERE
quantity > 0;
Exercise 4: Display ordered view
Summary: Show the view you created with totalAmount
sorted greatest to least
**Steps to complete: **
- Select all columns form the view
sales
- Order the
totalAmount
column from greatest to least
SELECT
*
FROM
sales
ORDER BY
totalAmount DESC;
Exercise 5: View countries
Summary: Show a list of all unique countryName
values in the sales
view
Steps to complete:
- Write a query that returns only distinct
countryName
values
SELECT
DISTINCT(countryName)
FROM
sales
Exercise 6: Create a temporary view: salesQuants
Summary: Create a temporary view that shows total quantity
of items purchased from each countryName
Steps to complete:
- Create a temporary view named
salesQuants
- Display the sum of all
quantity
values grouped bycountryName
. Name that columntotalQuantity
- Order the view by
totalQuantity
from greatest to least
CREATE
OR REPLACE TEMP VIEW salesQuants AS
SELECT
SUM(quantity) AS totalQuantity,
countryName
FROM
sales
GROUP BY
countryName
ORDER BY
totalQuantity DESC;
SELECT * FROM salesQuants;
Exercise 7: Read in a new parquet table
Summary: Create a new table named countryCodes
.
Steps to complete:
- Drop any existing tables named
countryCodes
from your database - Use this path:
/mnt/training/countries/ISOCountryCodes/ISOCountryLookup.parquet
to create a new table using parquet as the data source. Name itcountryCodes
- Include options to indicate that there is a header for this table
DROP TABLE IF EXISTS countryCodes;
CREATE TABLE countryCodes USING parquet OPTIONS (
path "/mnt/training/countries/ISOCountryCodes/ISOCountryLookup.parquet",
header "true"
);
Exercise 8: View metadata
Summary: View column names and data types in this table.
Steps to complete:
- Use the
DESCRIBE
command to display all of column names and their data types
DESCRIBE countryCodes;
Exercise 9: Join and Visualize
Summary: Use the salesQuants
view and the countryCodes
table to display a pie chart that shows total sales by country, and identifies the country by its 3-letter id.
Steps to complete:
- Write a query that results in two columns:
totalQuantity
fromsalesQuants
andalpha3Code
fromcountryCodes
- Join
countryCodes
withsalesQuants
on the name of country listed in each table - Visualize your results as a pie chart that shows the percent of sales from each country
SELECT
totalQuantity,
countryCodes.alpha3Code AS countryAbbr
FROM
salesQuants
JOIN countryCodes ON countryCodes.EnglishShortName = salesQuants.CountryName
Sanity Check
It’s always smart to do a sanity check when manipulating and joining datasets.
- Compare your chart to the table you displayed in task #4
- Try the challenge problem to figure out what may have gone wrong
Hint:
Display a distinct list of all countries in the countryCodes table. Compare that list with the distinct list of countries from the outdoorProducts chart.
SELECT DISTINCT(EnglishShortName) FROM countryCodes ORDER BY EnglishShortName DESC;
CREATE
OR REPLACE TEMPORARY VIEW modCountryCodes AS
SELECT
alpha3code,
REPLACE (
EnglishShortName,
"United Kingdom of Great Britain and Northern Ireland",
"United Kingdom"
) AS EnglishShortName
FROM
countryCodes;
SELECT
totalQuantity,
modCountryCodes.alpha3Code AS countryAbbr
FROM
salesQuants
JOIN modCountryCodes ON modCountryCodes.EnglishShortName = salesQuants.CountryName
Note: If you notice any error in my articles, feel free to reach out and let me know of the same, and I will update the blog post accordingly.