Exploratory Data Analysis in SparkSQL: Hands-on

Huzefa Khan
5 min readMar 30, 2023

--

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

Spark SQL is a very important and most used module that is used for structured data processing. Spark SQL allows you to query structured data using either SQL or DataFrame API.
Here is some examples of EDA using sparkSQL at Data bricks environments. Let’s get started with implementing EDA on Sample Data sets

  • Create tables
  • Create temporary views
  • Write basic SQL queries to explore, manipulate, and present data
  • Join two views and visualize the result

Working with Retail Data

For this lab, we’ll be working with a generated dataset meant to mimic data collected for online retail transactions. You can use the following path to access the data: Link

Exercise 1: Create a table

Summary: Create a new table named outdoorProducts with the following schema:

Steps to complete:

  • Make sure this notebook is idempotent by dropping any tables that have the name outdoorProducts
  • Use csv as the specified data source
  • Use the path provided above to access the data
  • This data contains a header; include that in your table creation statement
DROP TABLE IF EXISTS outdoorProducts;
CREATE TABLE outdoorProducts (
invoiceNo STRING,
stockCode STRING,
description STRING,
quantity INT,
invoiceDate STRING,
unitPrice DOUBLE,
customerID STRING,
countryName STRING
) USING csv OPTIONS (
path "/mnt/training/online_retail/data-001/data.csv",
header "true"
);

Exercise 2: Explore the data

Summary: Count the number of items that have a negative quantity

This table keeps track of online transactions, including returns. Some of the quantities in the quantity column show a negative number. Run a query that counts then number of negative values in the quantity column.

Steps to complete:

  • Write a query that reports the number of values less than 0 in the quantity column
SELECT
count(*)
FROM
outdoorProducts
WHERE
quantity < 0;

Here is Output

Exercise 3: Create a temporary view

Summary: Create a temporary view that includes only the specified columns and rows, and uses math to create a new column.

Steps to complete:

  • Create a temporary view named sales
  • Create a new column, totalAmount, by multiplying quantity times unitPrice and rounding to the nearest cent
  • Include columns: stockCode, quantity, unitPrice, totalAmount, countryName
  • Include only rows where quantity is greater than 0
CREATE
OR REPLACE TEMPORARY VIEW sales AS
SELECT
stockCode,
quantity,
unitPrice,
ROUND(quantity * unitPrice, 2) AS totalAmount,
countryName
FROM
outdoorProducts
WHERE
quantity > 0;

Exercise 4: Display ordered view

Summary: Show the view you created with totalAmount sorted greatest to least

**Steps to complete: **

  • Select all columns form the view sales
  • Order the totalAmount column from greatest to least
SELECT
*
FROM
sales
ORDER BY
totalAmount DESC;

Exercise 5: View countries

Summary: Show a list of all unique countryName values in the sales view

Steps to complete:

  • Write a query that returns only distinct countryName values
SELECT
DISTINCT(countryName)
FROM
sales

Exercise 6: Create a temporary view: salesQuants

Summary: Create a temporary view that shows total quantity of items purchased from each countryName

Steps to complete:

  • Create a temporary view named salesQuants
  • Display the sum of all quantity values grouped by countryName. Name that column totalQuantity
  • Order the view by totalQuantity from greatest to least
CREATE
OR REPLACE TEMP VIEW salesQuants AS
SELECT
SUM(quantity) AS totalQuantity,
countryName
FROM
sales
GROUP BY
countryName
ORDER BY
totalQuantity DESC;

SELECT * FROM salesQuants;

Exercise 7: Read in a new parquet table

Summary: Create a new table named countryCodes.

Steps to complete:

  • Drop any existing tables named countryCodes from your database
  • Use this path: /mnt/training/countries/ISOCountryCodes/ISOCountryLookup.parquet to create a new table using parquet as the data source. Name it countryCodes
  • Include options to indicate that there is a header for this table
DROP TABLE IF EXISTS countryCodes;
CREATE TABLE countryCodes USING parquet OPTIONS (
path "/mnt/training/countries/ISOCountryCodes/ISOCountryLookup.parquet",
header "true"
);

Exercise 8: View metadata

Summary: View column names and data types in this table.

Steps to complete:

  • Use the DESCRIBE command to display all of column names and their data types
DESCRIBE countryCodes;

Exercise 9: Join and Visualize

Summary: Use the salesQuants view and the countryCodes table to display a pie chart that shows total sales by country, and identifies the country by its 3-letter id.

Steps to complete:

  • Write a query that results in two columns: totalQuantity from salesQuants and alpha3Code from countryCodes
  • Join countryCodes with salesQuants on the name of country listed in each table
  • Visualize your results as a pie chart that shows the percent of sales from each country
SELECT
totalQuantity,
countryCodes.alpha3Code AS countryAbbr
FROM
salesQuants
JOIN countryCodes ON countryCodes.EnglishShortName = salesQuants.CountryName

Sanity Check

It’s always smart to do a sanity check when manipulating and joining datasets.

  • Compare your chart to the table you displayed in task #4
  • Try the challenge problem to figure out what may have gone wrong

Hint:

Display a distinct list of all countries in the countryCodes table. Compare that list with the distinct list of countries from the outdoorProducts chart.

SELECT DISTINCT(EnglishShortName) FROM countryCodes ORDER BY EnglishShortName DESC;
CREATE
OR REPLACE TEMPORARY VIEW modCountryCodes AS
SELECT
alpha3code,
REPLACE (
EnglishShortName,
"United Kingdom of Great Britain and Northern Ireland",
"United Kingdom"
) AS EnglishShortName
FROM
countryCodes;
SELECT
totalQuantity,
modCountryCodes.alpha3Code AS countryAbbr
FROM
salesQuants
JOIN modCountryCodes ON modCountryCodes.EnglishShortName = salesQuants.CountryName

Note: If you notice any error in my articles, feel free to reach out and let me know of the same, and I will update the blog post accordingly.

--

--

Huzefa Khan
Huzefa Khan

Written by Huzefa Khan

Passionate Sr. Data Engineer with years of experience in developing and architecting high-class data solutions https://www.linkedin.com/in/huzzefakhan/

No responses yet