Basic Queries with Spark SQL

4 min readJan 2, 2021

As you notice that Spark SQL syntax and patterns are the same as the SQL you would use in other modern database systems.

Create table

We are going to be working with different files (and file formats) throughout this course. The first thing we need to do, in order to access the data through our SQL interface, is create a table from that data.

A Databricks table is a collection of structured data. We will use Spark SQL to query tables.

Lets take example: This table contains 10 million fictitious records that hold facts about people, like first and last names, date of birth, salary, etc. We’re using the Parquet file format, which is commonly used in many big data workloads.

DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
path "/mnt/training/dataframes/people-10m.parquet",
header "true");

Querying tables

start with a simple SELECT statement to get a view of the data.

SELECT * FROM People10M;

view the schema for this table by using the DESCRIBE function.

DESCRIBE People10M;

Displaying query results

Any query that starts with a SELECT statement automatically displays the results below. We can use a WHERE clause to limit the results to those that meet a given condition or set of conditions.

SELECT
 firstName,
 middleName,
 lastName,
 birthDate
FROM
 People10M
WHERE
 year(birthDate) > 1990
 AND gender = 'F'

Math

Spark SQL includes many built-in functions that are also used in standard SQL. We can use them to create new columns based on a rule.

SELECT MOD(4,0.011);

Temporary Views

So far, you’ve been working with Spark SQL by querying a table that we defined for you. In the following exercises, we will work with **temporary views**. Temporary views are useful for data exploration. It gives you a name to query from SQL, but unlike a table, does not carry over when you restart the cluster or switch to a new notebook. Also, temporary views will not show up in the `Data` tab.

In the cell below, we create a temporary view that holds all the information

CREATE OR REPLACE TEMPORARY VIEW PeopleSavings AS
SELECT
 firstName,
 lastName,
 year(birthDate) as birthYear,
 salary,
 salary * 0.2 AS savings
FROM
 People10M;

Query Views

For the most part, you can query a view exactly as you would query a table. The query below uses the built-in function AVG() to calculate avgSalary grouped by birthYear. This is an aggregate function, which means it's meant perform an calculation on a set of values. You must include a GROUP BY clause to identify the subset of values you want to summarize.

The final clause, ORDER BY, declares the column that will control the order in which the rows appear, and the keyword DESC means they will appear in descending order.

SELECT
 birthYear,
 ROUND(AVG(salary), 2) AS avgSalary
FROM
 peopleSavings
GROUP BY
 birthYear
ORDER BY
 avgSalary DESC

Define a new table

Now we will show you how to create a table using Parquet. Parquet is an open-source, column-based file format. Apache Spark supports many different file formats; you can specify how you want your table to be written with the USING keyword.

For now, we will focus on the commands we will use to create a new table.

This data contains information about the relative popularity of first names in the United States by year from 1880–2016.

Line 1: Tables must have unique names. By including the DROP TABLE IF EXISTS command, we are ensuring that the next line (CREATE TABLE) can run successfully even if this table has already been created. The semi-colon at the end of the line allows us to run another command in the same cell.

Line 2: Creates a table named ssaNames, defines the data source type (parquet) and indicated that there are some optional parameters to follow.

Line 3: Identifies the path to the file in object storage

Line 4: Indicates that the first line of the table should be treated as a header.

DROP TABLE IF EXISTS ssaNames;
CREATE TABLE ssaNames USING parquet OPTIONS (
 path "/mnt/training/ssn/names.parquet",
 header "true"
)

Preview the data

Run the cell below to preview the data. Notice that the LIMIT keyword restricts the number of returned rows to the specified limit.

SELECT
 *
FROM
 ssaNames
LIMIT
 5;

Count distinct values

First, we query tables to get a list of the distinct values in any field.

SELECT count(DISTINCT firstName)
FROM SSANames;

Create temporary views

we create two temporary views

CREATE OR REPLACE TEMPORARY VIEW SSADistinctNames AS 
 SELECT DISTINCT firstName AS ssaFirstName 
 FROM SSANames;

CREATE OR REPLACE TEMPORARY VIEW PeopleDistinctNames AS 
 SELECT DISTINCT firstName 
 FROM People10M

Perform join

Now, we can use the view names to join the two data sets. If you are new to using SQL, you may want to learn more about the different types of joins you can perform. This wikipedia article offers complete explanations, with pictures and sample SQL code.

SELECT firstName 
FROM PeopleDistinctNames 
JOIN SSADistinctNames ON firstName = ssaFirstName

Reference:

https://docs.databricks.com/data/index.html