The Azure PostgreSQL Workshop

Welcome to the Azure PostgreSQL Workshop. In this lab, you’ll go through tasks that will help you master the basic and more advanced topics required to deploy and operate a PostgreSQL environment on Azure Database for PostgreSQL).

You can use this guide as a PostgreSQL tutorial and as study material to help you get started to learn PostgreSQL.

Theme: Deployment

Azure Database for PostgreSQL Provisioning
Data import and environment preparation
Access and basic Administration in Azure Database for PostgreSQL
- Managing databases
- Roles and Permissions

Theme: Accessibility and Business Continuity

Logical backup
Back-ups/restore
HA/DR
Patching and maintenance windows
Security Management in Azure Database for PostgreSQL

Theme: Day two operations

Monitoring and Troubleshooting in Azure Database for PostgreSQL
Deep dive into performance issue resolution and identify optimization fixes
- Multiversion Concurrency Control, MVCC
- Daemons & Tuning
- SQL characteristic
- Statistics and Queries

Prerequisites

Tools

You can use the Azure Cloud Shell accessible at https://shell.azure.com once you login with an Azure subscription. The Azure Cloud Shell has the Azure CLI pre-installed and configured to connect to your Azure subscription as well as psql and other Postgres utilities like pg_dump, createdb or createuser that will be used throughout the training, your access to the database might be through a jump-box in between cloudshell and PostgreSQL environment.

Azure subscription

If the workshop will run on your Azure subscription

Please use your username and password to login to https://portal.azure.com.

Also please authenticate your Azure CLI by running the command below on your machine and following the instructions.

az account show
az login

If the workshop will run on Azure Pass

Login with a github account with the provided link: https://azcheck.in/xxxxxxx (Please use the provided one)
Follow the instructions, basically copy the code and go to: https://www.microsoftazurepass.com/ to redeem the voucher and click on Start>.

For more information follow : https://www.microsoftazurepass.com/Home/HowTo?Length=5

Azure Cloud Shell

You can use the Azure Cloud Shell accessible at https://shell.azure.com once you login with an Azure subscription.

Head over to https://shell.azure.com and sign in with your Azure Subscription details.

Select Bash as your shell.

Select Bash

Select Show advanced settings

Select show advanced settings

Set the Storage account and File share names to your resource group name (all lowercase, without any special characters), then hit Create storage

Azure Cloud Shell

You should now have access to the Azure Cloud Shell

Set the storage account and fileshare names

Tips for uploading and editing files in Azure Cloud Shell

You can use code <file you want to edit> in Azure Cloud Shell to open the built-in text editor.
You can upload files to the Azure Cloud Shell by dragging and dropping them
You can also do a curl -o filename.ext https://file-url/filename.ext to download a file from the internet.

Workshop Overview

You will be deploying the below architecture using Bicep

Workshop diagram

Based on the workshop that is running you might be deploying the Apache Superset application which is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geo-spatial charts.that is containerized and is architected for a microservice implementation.

Getting up and running

Deploy Azure Database for PostgreSQL with Azure Portal

Azure Database for PostgreSQL is a fully-managed database as a service with built-in capabilities, such as high availability and intelligence.

Tasks

Use Bicep to deploy Azure Database for PostgreSQL - Flexible server.
Use Azure Cloud Shell to connect to DNS VM then you can access the Azure PostgreSQL database..

Setup Database

Deploy DB using Bicep

Deploy Azure Database for PostgreSQL, which is managed service that you can use to run, manage, and scale in the cloud.

Based on the previous section, the cloudshell environment should be available, log to your cloudshell - bash. the next step is to install bicep

az bicep install

Install Bicep

Download the bicep templates for the workshop

wget https://pg.azure-workshops.cloud/scripts/bicep.zip

Download Bicep Templates

Uncompress the the downloaded file

unzip bicep

Uncompress the downloaded file Templates

Create a new azure resource group to deploy the workshop resource in this resource group. Please take note of the resource group name in this below command it will be PG-Workshop.

az group create -l Eastus -n PG-Workshop

Create PG workshop resource group

In this step we will deploy the bicep template to the resource group that we created in the previous step

az deployment group create --resource-group PG-Workshop --template-file bicep/main.bicep

You will be asked for 4 questions

Admin username for the Jump-box (DNS)
Admin password for the Jump-box
Admin username for the PostgreSQL database (use your name rather than admin/root)
Admin password for the PostgreSQL database (Please use strong password)

Create PG workshop resource group

It will take a while to deploy the environment, when it finish you should see long output mentioning succeeded in the last 10 lines.

Created PG workshop resource group

After the creation process finished you should be able to see the resource in the resource groups, go to Resource Groups

Resource Groups

Click on the resource group name that we created, PG-Workshop if you didn’t change the az command to create the resource group.

Resource Groups

Visit both DNS VM to get the public IP and the PostgreSQL Flexible Server to get the access endpoint, please keep them as we will use them during the workshop.

Resource Groups

Now we move the next section to connect to the DB.

Connecting to PostgreSQL

In order to connect to a database you need to know the name of your target database, the host name and port number of the server, and what user name you want to connect as.

You can save yourself some typing by setting the environment variables PGDATABASE, PGHOST, PGPORT and/or PGUSER to appropriate values. It is also convenient to have a ~/.pgpass file to avoid regularly having to type in passwords.

Basic psql options

-d dbname
--dbname=dbname
    Specifies the name of the database to connect to. This is equivalent to specifying dbname as the first non-option argument on the command line.

-h hostname
--host=hostname
    Specifies the host name of the machine on which the server is running.

-p port
--port=port
    Specifies the TCP port or the local Unix-domain socket file extension on which the server is listening for connections. Defaults to the value of the PGPORT environment variable or, if not set, to the port specified at compile time, usually **5432**.

-U username
--username=username
    Connect to the database as the user username instead of the default. (You must have permission to do so, of course.)
-W
   --password
    Force psql to prompt for a password before connecting to a database.

Using Azure Cloud Shell

Example (please change these values to match with your setup) from the cloudshell

ssh username@<jumpbox-ip> # the DNS VM IP Address, and the username that you selected in deployment

ssh access

In the first time acessing the jumpbox make sure that you have psql installed use the following commands once you logon

sudo dnf module enable -y postgresql:13
sudo dnf install -y postgresql

You should see output like this

Install PG client

Then connect to the database

psql -U adminuser -h postgresql-db.postgres.database.azure.com postgres

Install PG client

Getting the connection string from Azure Portal

In this task, we will create a file in our Cloud Shell containing libpq environment variables that will be used to select default connection parameter values to PostgreSQL PaaS instance. These are useful to be able to connect to postgres in a fast and convenient way without hard-coding connection string.

Go to the “Connection Strings” tab on the left hand side of the Azure Portal and find psql connection string:

Open Cloud Shell and create a new .pg_azure file using your favourite editor (if you are not comfortable with Vim you can use VSCode):

Using VIM

vi .pg_azure

Add the following parameters or use the below wget command to download the file:

export PGDATABASE=postgres
export PGHOST=HOSTNAME.postgres.database.azure.com
export PGUSER=adminuser
export PGPASSWORD=your_password
export PGSSLMODE=require

Using Wget

wget https://pg.azure-workshops.cloud/scripts/pg_azure -O .pg_azure

Read the content of the file in the current session:

source .pg_azure

If you closed this bash session, you won’t be able to login again to psql without reading .pg_azure.

Let’s connect to our Azure database with psql client:

psql

You should connect to the postgres without any parameters

Task Hints You can also use the connection string shown in the Azure Portal in the Connection String tab. Using libpq variables is another option to ease your work with Postgres.

While you have the psql connected to the database, let’s run some quries:

SELECT version();

You should be able to read the PostgreSQL version.

Create table with some random data:

DROP TABLE IF EXISTS random_data;

CREATE TABLE random_data AS
SELECT s                    AS first_column,
   md5(random()::TEXT)      AS second_column,
   md5((random()/2)::TEXT)  AS third_column
FROM generate_series(1,500000) s;

Let’s select some of the records that we generated:

SELECT * FROM random_data LIMIT 10;

SELECT count(*) FROM random_data;

Data import and environment preparation

psql is a terminal-based front-end to PostgreSQL. It enables you to type in queries interactively, issue them to PostgreSQL, and see the query results.

Anything you enter in psql that begins with an unquoted backslash (\) is a psql meta-command that is processed by psql itself.

These commands make psql more useful for administration or scripting. Meta-commands are often called slash or backslash commands. The format of a psql command is the backslash, followed immediately by a command verb, then any arguments. The arguments are separated from the command verb and each other by any number of whitespace characters.

List the databases in the cluster:

psql

postgres=> \l

List the databases in the cluster with their sizes:

postgres=> \l+

Copy and paste following statements:

CREATE DATABASE quiz;
\connect quiz

CREATE TABLE public.answers (
    question_id serial NOT NULL,
    answer text NOT NULL,
    is_correct boolean NOT NULL DEFAULT FALSE
);

CREATE TABLE public.questions (
    question_id integer NOT NULL,
    question text NOT NULL
);

ALTER TABLE ONLY public.answers
    ADD CONSTRAINT answers_pkey PRIMARY KEY (question_id, answer);

ALTER TABLE ONLY public.questions
    ADD CONSTRAINT questions_pkey PRIMARY KEY (question_id);

ALTER TABLE ONLY public.answers
    ADD CONSTRAINT question_id_answers_fk FOREIGN KEY (question_id) REFERENCES public.questions(question_id);

CREATE SCHEMA calc;
CREATE OR REPLACE FUNCTION calc.increment(i integer) RETURNS integer AS $$
        BEGIN
                RETURN i + 1;
        END;
$$ LANGUAGE plpgsql;

CREATE VIEW calc.vista AS SELECT $$I'm in calc$$;

CREATE VIEW public.vista AS SELECT $$I'm in public$$;

INSERT INTO public.questions (question_id, question) VALUES (1, 'Jaki symbol chemiczny ma tlen?');

INSERT INTO public.answers (question_id, answer, is_correct) VALUES (1, 'Au', false);
INSERT INTO public.answers (question_id, answer, is_correct) VALUES (1, 'O', true);
INSERT INTO public.answers (question_id, answer, is_correct) VALUES (1, 'Oxy', false);
INSERT INTO public.answers (question_id, answer, is_correct) VALUES (1, 'Tl', false);

List the databases in the cluster:

postgres=> \l

Schema list

Lists schemas (namespaces) in the current database:

quiz=> \dn
     List of schemas
  Name  |     Owner
--------+----------------
 calc   | masteruser
 public | azure_pg_admin
(2 rows)

Check your current connection:

quiz=> \conninfo
You are connected to database "quiz" as user "masteruser" on host "psqlflexlekjqqdqpcfja.postgres.database.azure.com" (address "192.168.1.132") at port "5432".
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)

Check what’s going on in your instance:

quiz=> TABLE pg_stat_activity;

It’s unreadable, isn’t it? Change the display format:

quiz=> \x auto

And try to display the same view again:

quiz=> SELECT * FROM pg_stat_activity;

Display all relations (including tables, views, materialized views, indexes, sequences, or foreign tables) in your database:

quiz=> \d
                   List of relations
 Schema |          Name           |   Type   |  Owner
--------+-------------------------+----------+----------
 public | answers                 | table    | postgres
 public | answers_question_id_seq | sequence | postgres
 public | questions               | table    | postgres
 public | vista                   | view     | postgres
(4 rows)

Display all relations with their sizes and description:

quiz=> \d+
                                 List of relations
 Schema |          Name           |   Type   |  Owner   |    Size    | Description
--------+-------------------------+----------+----------+------------+-------------
 public | answers                 | table    | postgres | 16 kB      |
 public | answers_question_id_seq | sequence | postgres | 8192 bytes |
 public | questions               | table    | postgres | 16 kB      |
 public | vista                   | view     | postgres | 0 bytes    |
(4 rows)

Display information about one, specific table:

quiz=> \dt+ pg_class

Display tables, which names start with “pg_c[…]”

quiz=> \dt pg_c*

Check out help information:

quiz=> \?

Display the number of total connections to your database:

quiz=> SELECT count(*) FROM pg_stat_activity;

Watch the change of the count over time:

quiz=> \watch

Stop the process:

quiz=> [Ctrl+c]

List all system views:

quiz=> \dvS

Turns displaying of how long each SQL statement takes on:

quiz=> \timing

List the system functions:

quiz=> \dfS

Fetch and display the definition of the chosen function, in the form of a CREATE OR REPLACE FUNCTION command:

quiz=> \sf abs(bigint)

Print psql’s command line history:

quiz=> \s

Search for a specific command in the history:

quiz=> [Ctrl+r + part of the command string]

Basic PostgreSQL Administration

Managing PostgreSQL DB

Managing Compute and Storage

Navigate to Compute + Storage, you will be able to alter storage and compute. Also, navigate to change the backup retention.

You don’t have to change the compute size as it might have additional cost for having a bigger instance.

Compute and Storage

Managing Server Parameters

As you don’t have access to configuration files, you can change server parameters through the Azure Portal/APIs. All the changes made here provide default values for the entire cluster. Users can also make changes on the database level with ALTER DATABASE command, on the role level with ALTER ROLE command, or on the session level with the SET command.

managing parameters

In the search box type pgbouncer and change the value to TRUE:

managing parameters

Save the changes, and wait until the new deployment finish successfully:

managing parameters

Once you see the success screen, go to the VM and try to access PostgreSQL through port 6432:

psql -p 6432

managing parameters

Apply Server Locks

Navigate to Locks:

managing locks

Click on +Add, add Lock name of your choice and lock type with Delete:

managing locks

If you try to delete the server it should give the such below error:

managing locks

Roles and Permissions

Connect to the PostgreSQL instance:

[diaa@dns ~]$ psql 
psql (13.5, server 13.6)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

postgres=> 

Create new group:

postgres=> CREATE GROUP monty_python;

Create a new user Graham that belongs to monty_python group and doesn’t inherit any privileges from the group. Allow the user to have maximum 2 active connection:

postgres=> CREATE USER Graham CONNECTION LIMIT 2 IN ROLE monty_python NOINHERIT;

Create a new user Eric that belongs to monty_python group and inherits privileges from the group. Allow the user to have maximum 2 active connections:

postgres=> CREATE USER Eric CONNECTION LIMIT 2 IN ROLE monty_python INHERIT;

Display all the roles available in the cluster:

postgres=> \dg
                                        List of roles
   Role name    |                         Attributes                         |   Member of
----------------+------------------------------------------------------------+----------------
 eric           | 2 connections                                              | {monty_python}
 graham         | No inheritance                                            +| {monty_python}
                | 2 connections                                              |
 monty_python   | Cannot login                                               | {}
 postgres       | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Connect to the quiz database:

postgres=> \c quiz

Grant all privileges for all tables in schema public to group monty_python:

quiz=> GRANT ALL ON ALL TABLES IN SCHEMA public TO monty_python;
GRANT

In order to be able to switch to another role you need to grant this permission to the current user. Replace adminuser with your pg admin user name:

GRANT graham to adminuser;
GRANT eric to adminuser;

Switch to the Graham user:

quiz=> SET ROLE TO graham;
SET

Try to check the content of answers table as user Graham:

quiz=> TABLE answers;
ERROR:  permission denied for table answers

Change the user and try to query the table again:

quiz=> SET ROLE TO eric;
SET
quiz=> table answers;
 question_id | answer | is_correct
-------------+--------+------------
           1 | Au     | f
           1 | O      | t
           1 | Oxy    | f
           1 | Tl     | f
(4 rows)

Why user Graham doesn’t have permission to view the content of answer table?:

Changing permissions

Grant the SELECT privilege on table answers to user Graham:

quiz=> GRANT SELECT ON TABLE answers TO Graham;
WARNING:  no privileges were granted for "answers"
GRANT

Switch back to the superuser account and try again:

quiz=> SET ROLE TO adminuser;
SET
quiz=> GRANT SELECT ON TABLE answers TO Graham;
GRANT

Check if user Graham is able to query the table:

quiz=> SET ROLE TO graham;
SET
quiz=> TABLE answers;
 question_id | answer | is_correct
-------------+--------+------------
           1 | Au     | f
           1 | O      | t
           1 | Oxy    | f
           1 | Tl     | f
(4 rows)

Display all granted privileges:

quiz=> \dp
                                     Access privileges
 Schema |   Name    | Type  |       Access privileges       | Column privileges | Policies
--------+-----------+-------+-------------------------------+-------------------+----------
 public | answers   | table | postgres=arwdDxt/postgres    +|                   |
        |           |       | monty_python=arwdDxt/postgres+|                   |
        |           |       | graham=r/postgres             |                   |
 public | questions | table | postgres=arwdDxt/postgres    +|                   |
        |           |       | monty_python=arwdDxt/postgres |                   |
(2 rows)

Granting roles As user Graham try to DELETE all records from table answers:

quiz=> DELETE FROM answers ;
ERROR:  permission denied for table answers

As adminuser copy all privileges from user eric to user graham:

quiz=> \c
You are now connected to database "quiz" as user "postgres".
quiz=> GRANT eric TO graham ;
GRANT ROLE

As user Graham try to DELETE all records from table answers:

quiz=> set role to graham;
SET
quiz=> DELETE FROM answers ;
ERROR:  permission denied for table answers
quiz=> SET ROLE TO adminuser;
quiz=> GRANT DELETE ON TABLE answers TO graham;
GRANT
quiz=> SET role TO Graham;
SET
quiz=> DELETE FROM answers ;

Display permissions granted to objects and information about roles:

quiz=> \dp
                                     Access privileges
 Schema |   Name    | Type  |       Access privileges       | Column privileges | Policies

--------+-----------+-------+-------------------------------+-------------------+---------
-
 public | answers   | table | postgres=arwdDxt/postgres    +|                   |
        |           |       | monty_python=arwdDxt/postgres+|                   |
        |           |       | graham=r/postgres            +|                   |
        |           |       | eric=d/postgres               |                   |
 public | questions | table | postgres=arwdDxt/postgres    +|                   |
        |           |       | monty_python=arwdDxt/postgres |                   |
(2 rows)

quiz=> \dg
                                           List of roles
   Role name    |                         Attributes                         |      Member of
----------------+------------------------------------------------------------+---------------------
 eric           | 2 connections                                              | {monty_python}
 graham         | No inheritance                                            +| {monty_python,eric}
                | 2 connections                                              |
 monty_python   | Cannot login                                               | {}
 postgres       | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

Without INHERIT, membership in another role only grants the ability to SET ROLE to that other role; the privileges of the other role are only available after having done so.

Revoke DELETE privilege from eric:

quiz=> \c
You are now connected to database "quiz" as user "adminuser".
quiz=> REVOKE DELETE ON TABLE answers FROM eric;
REVOKE
quiz=> set role to eric;
SET
quiz=> delete from answers ;
DELETE 0

As you see user Eric is still able to perform DELETE operation because of his membership in role monty_python.

Accessibility and Business Continuity

Logical Backup

Plain pg_dump

Make sure you have sourced the file with libpq variables on your current session:

source .pg_azure

pg_dump like other native postgres utilities is able to use them to connect to your database instance.

Run pg_dump against quiz database:

pg_dump quiz

You can redirect the output of pg_dump to another program:

pg_dump quiz|less

Save the output of pg_dump:

pg_dump quiz > /tmp/quiz.plain.dump

Check the content of file:

less /tmp/quiz.plain.dump

Dump only the schema (without any data):

pg_dump --schema-only quiz > /tmp/quiz_ddl.plain.dump

Check the content of file:

less /tmp/quiz_ddl.plain.dump

You don’t have to each time specify the database name, you can for instance overwrite PGDATABASE variable just for this session:

export PGDATABASE=quiz 

Let’s dump only the data:

pg_dump --data-only > /tmp/quiz_data.plain.dump

Check the content of file:

less /tmp/quiz_data.plain.dump

Change the COPY command to INSERT:

pg_dump --data-only --inserts > /tmp/quiz_data_insert.plain.dump

Check the content of file:

less /tmp/quiz_data_insert.plain.dump

Dump only one table:

pg_dump --table=answers > /tmp/answers.plain.dump

Check the content of file:

less /tmp/answers.plain.dump

Check all the options for pg_dump:

pg_dump --help

Restore from plain dump

Drop database quiz:

dropdb quiz

Recreate it:

createdb quiz

Restore it from your dump:

psql -f /tmp/quiz.plain.dump

Watch out for errors! You might want to redirect errors to a separate file:

psql -f /tmp/quiz.plain.dump 2> errors.txt
less errors.txt

Directory format

Create dump using directory format. That is the only format where you can many parallel jobs to dump your database.

pg_dump quiz -Fd -f /tmp/directorydump

Check the content of files in the directory:

zless /tmp/directorydump/*dat.gz

Restore from directory format dump

Drop database quiz:

dropdb quiz

Recreate it:

createdb quiz

Restore it from your dump:

pg_restore -d quiz /tmp/directorydump

Global objects dump

Dump logically the whole instance:

pg_dumpall > /tmp/whole_cluster.plain.dump
less /tmp/whole_cluster.plain.dump

Dump only the global objects:

pg_dumpall -g > /tmp/globals.plain.dump
less /tmp/globals.plain.dump

You will encounter errors because you don’t have access to export the passwords, if you run the following command you won’t see the errors:

pg_dumpall -g --no-role-passwords > /tmp/globals.plain.dump

Physical Backup and Point in Time Restore

Backup

While creating a server through the Azure portal, the Compute tier tab is where you select either Burstable **, **“General purpose” or Memory Optimized backups for your server. This window is also where you select the Backup Retention Period - how long (in number of days) you want the server backups stored for.

Azure backup

The backup retention period governs how far back in time a point-in-time restore can be retrieved, since it’s based on backups available. Point-in-time restore is described further in the following section.

Azure backup

After changing the retention period make sure to click Save

When you see the deployment finished, it means that retention period has changed

Azure backup

Point-in-time restore

Azure Database for PostgreSQL Flexible server allows you to restore the server back to a point-in-time and into to a new copy of the server. You can use this new server to recover your data, or have your client applications point to this new server.

For example, if a table was accidentally dropped at noon today, you could restore to the time just before noon and retrieve the missing table and data from that new copy of the server. Point-in-time restore is at the server level, not at the database level.

The following steps restore the sample server to a point-in-time:

In the Azure portal, select your Azure Database for PostgreSQL Flexible server.
In the toolbar of the server’s Overview page, select Restore.

Azure backup

The new server created by point-in-time restore has the same server admin login name and password that was valid for the existing server at the point-in-time chose. You can change the password from the new server’s Overview page.

The new server created during a restore does not have the firewall rules or VNet service endpoints that existed on the original server. These rules need to be set up separately for this new server.

Database replication

Native logical replication

Logical replication uses the terms ‘publisher’ and ‘subscriber’.

The publisher is the PostgreSQL database you’re sending data from.
The subscriber is the PostgreSQL database you’re sending data to.

On the CloudShell run the following command to create new flexible server

az postgres flexible-server create --vnet spoke-vnet --subnet subnet-02 --resource-group PG-Workshop \
  --name replication-flex-$$ --admin-user replica --admin-password 'PkG3zk&SKt' \
  --sku-name Standard_B1ms --tier Burstable --storage-size 128 \
  --tags "key=replica" --version 13 --high-availability Disabled

Azure Replication

While the the the replication server being deployed, you can continue working on the primary database to change the parameters:

Go to server parameters page on the portal.
Set the server parameter wal_level to logical.
Update max_worker_processes parameter value to at least 16. Otherwise, you may run into issues like WARNING: out of background worker slots.
Save the changes and restart the server to apply the wal_level change.
Confirm that your PostgreSQL instance allows network traffic from your connecting resource.
Grant the admin user replication permissions.

ALTER ROLE <adminname> WITH REPLICATION;

The adminname is the PostgreSQL user that we used while creating the database in first section.

While on the primary database, we should create the publication

\c quiz
CREATE PUBLICATION answers_pub FOR TABLE  answers;

Now. we can check if the database replica has finished, the output should be like the following the picture, Azure Replication Output

Once the replica server deployed, we need to change the Private DNS to private.postgresql.database.azure.com

Azure Replication Output

Notice the name of the PostgreSQL name it should be “replication-flex-<Random Number>.postgres.database.azure.com”

export PGPASSWORD='PkG3zk&SKt'; psql -d postgres  -U replica  -h replication-flex-<123>>.postgres.database.azure.com

Azure replica

Once connected, create the DDL of the database:

CREATE DATABASE quiz;
\connect quiz

CREATE TABLE public.answers (
    question_id serial NOT NULL,
    answer text NOT NULL,
    is_correct boolean NOT NULL DEFAULT FALSE
);

Azure backup

Now, we want to create the connection, make sure that you change the parameters to match your environment.

CREATE SUBSCRIPTION sub CONNECTION 'host=192.168.1.132 user=diaa dbname=quiz password=@n6DnfN&P' PUBLICATION answers_pub;

Check the answers table, the table shouldn’t be 0 rows :

table answers;

Azure backup

HA/DR

High Availability

Azure Database for PostgreSQL - Flexible Server offers high availability configuration with automatic failover capability using zone redundant server deployment. When deployed in a zone redundant configuration, flexible server automatically provisions and manages a standby replica in a different availability zone. Using PostgreSQL streaming replication, the data is replicated to the standby replica server in synchronous mode.

Tasks

Configure Azure Database for PostgreSQL - Flexible Server for High Availability

Click the high availability tab to configure high availability.

Azure backup

Ensure to check the box and save your changes. This will trigger a high availability deployment.

Azure backup

You will see a confirmation pop up, click yes.

Azure backup

You will reveive a message like this once deployment is complete.

Azure backup

Execute a Forced Failover

From Azure portal, click on the forced failover button to initiate a forced failover to secondary.

Azure backup

You will see a message like this once failover is complete.

Azure backup

Once the process is complete, check if the primary and secondary zones have interchanged.

Azure backup

Execute a planned failover

From Azure portal, click on the planned failover button to initiate a planned failover to secondary.

Azure backup

You will see a message like this once failover is complete.

Azure backup

Disable High Availability

To disable high available, navigate to high availablity tab.

Azure backup

Uncheck the availability option and click save.

Azure backup

Azure Backup

Backup and restore in Azure Database for PostgreSQL - Flexible Server

Azure Database for PostgreSQL - Flexible Server automatically performs regular backups of your server. You can then do a point-in-time recovery (PITR) as seen in the previous section, within a retention period that you specify.

Tasks:

Manual backup: Backing up database

You can manually take a backup by using the PostgreSQL tool pg_dump and pg_restore.

SSH into to the jumpbox vm.

ssh username@<jumpbox-ip

Execute below command to take a back up of the database.

pg_dump -Fc -v --host=<host> --username=<name> --dbname=<database name> -f <database>.dump

Restore manual dabase backup using pg_restore

pg_restore -v --no-owner --host=<server name> --port=<port> --username=<user-name> --dbname=<target database name> <database>.dump

You can read how to optimize the migration process here

Patching and maintenance windows

Scheduled maintenance in Azure Database for PostgreSQL – Flexible server

Azure Database for PostgreSQL - Flexible server performs periodic maintenance to keep your managed database secure, stable, and up-to-date. During maintenance, the server gets new features, updates, and patches. You can configure custom maintenance schedules for your server.

Tasks:

Selecting a custom maintenance window

Navigate to the maintenance tab in Azure portal for your flexible server.

Azure backup

Select a custom schedule from the options available in the drop down.

Azure backup

Click on save schedule to complete the configuration.

Azure backup

Security Management PostgreSQL

Installing pgAudit extension

Audit logging of database activities in Azure Database for PostgreSQL - Flexible server is available through the PostgreSQL Audit extension: pgAudit. pgAudit provides detailed session and/or object audit logging. To enable pgAudit on Azure Database for PostgreSQL - Flexible Server please follow the steps below.

On the sidebar, select Server Parameters and type extension in the search field.

pgAuditWhitelistExtensions

Hit Save

Once deployment is done go back to the Server Parameters and search for shared_preload_libraries and choose pgAudit.

pgAuditWhitelistExtensions

Hit Save and then choose Save and Restart.

At this point you have the extension installed and you can go ahead and CREATE EXTENSION is the database. Connect to your server using a client (like psql) and enable the pgAudit extension.

CREATE EXTENSION pgaudit;

Once you have enabled pgAudit, you can configure its parameters to start logging. To configure pgAudit you can follow below instructions. Using the Azure portal: On the sidebar, select Server Parameters and search for pgaudit. Change the pgaudit.log from NONE to ALL:

pgAuditWhitelistExtensions

From now on all the actions in your database will be traced.

Viewing audit logs

The way you access the logs depends on which endpoint you choose. We have configured storage account and mounted it into the VM.

Open two cloud shell terminals, one will be needed to run some SQL commands (psql) and on the second we will observe log generation. From psql run some queries:

CREATE TABLE a(id int);
INSERT into a SELECT generate_series(1,1000);

From the second terminal configure a mounted container location so you can have the storage account mounted on your filesystem:

Mounting Storage Account to VM

In this section you will mount Storage Account to your dns VM to be able to easier manipulate on log files.

First let’s download and install necessary packages. Feel free to simply copy and paste the following commands:

sudo rpm -Uvh https://packages.microsoft.com/config/rhel/8/packages-microsoft-prod.rpm
sudo dnf -y install blobfuse
sudo mkdir /mnt/ramdisk
sudo mount -t tmpfs -o size=16g tmpfs /mnt/ramdisk
sudo mkdir /mnt/ramdisk/blobfusetmp
sudo chown <your VM admin> /mnt/ramdisk/blobfusetmp

Authorize access to your storage account

You can authorize access to your storage account by using the account access key, a shared access signature, a managed identity, or a service principal. Authorization information can be provided on the command line, in a config file, or in environment variables.

For this exercise we will authorize with the account access keys and storing them in a config file. The config file should have the following format:

accountName myaccount
accountKey storageaccesskey
containerName insights-logs-postgresqllogs

Please prepare the following file in editor of your choice. Values for the accountName and accountKey you will find in the Azure Portal. Please navigate to your storage account in the portal and then choose Access keys page:

Server Parameters

Copy accountName and accountKey and paste it to the file. Copy the content of your file and paste it to the fuse_connection.cfg file in your home directory, then mount your storage account container onto the directory in your VM:

vi fuse_connection.cfg
chmod 600 fuse_connection.cfg
mkdir ~/mycontainer
sudo blobfuse ~/mycontainer --tmp-path=/mnt/resource/blobfusetmp  --config-file=/home/<your VM admin>/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120

sudo -i
cd /home/<your VM admin>/mycontainer/
ls # check if you see mounted container
# Please use tab key for directory autocompletion; do not copy and paste!
cd resourceId\=/SUBSCRIPTIONS/<your subscription id>/RESOURCEGROUPS/PG-WORKSHOP/PROVIDERS/MICROSOFT.DBFORPOSTGRESQL/FLEXIBLESERVERS/PSQLFLEXIKHLYQLERJGTM/y\=2022/m\=06/d\=16/h\=09/m\=00/
ls
less

and output appended data as the log file grows with the tail command:

tail -f

After a minute or two (please expect a slight delay) you will see your commands being registered by pgAudit. You should see the following lines:

{ "properties": {"timestamp": "2022-05-23 08:23:59.526 UTC","processId": 9300,"errorLevel": "LOG","sqlerrcode": "00000","message": "2022-05-23 08:23:59 UTC 9300 5-1 db-pgbench,user-pgadmin,app-psql,client-192.168.0.4 LOG:  AUDIT: SESSION,1,1,DDL,CREATE TABLE,TABLE,public.a,CREATE TABLE a(id int);,<not logged>"}, "resourceId": "/SUBSCRIPTIONS/***/RESOURCEGROUPS/PG-WORKSHOP/PROVIDERS/MICROSOFT.DBFORPOSTGRESQL/FLEXIBLESERVERS/PSQLFLEXIKHLYQLERJGTM", "category": "PostgreSQLLogs", "operationName": "LogEvent"}
{ "properties": {"timestamp": "2022-05-23 08:24:00.511 UTC","processId": 9300,"errorLevel": "LOG","sqlerrcode": "00000","message": "2022-05-23 08:24:00 UTC 9300 9-1 db-pgbench,user-pgadmin,app-psql,client-192.168.0.4 LOG:  AUDIT: SESSION,2,1,WRITE,INSERT,,,\"INSERT into a SELECT generate_series(1,1000);\",<not logged>"}, "resourceId": "/SUBSCRIPTIONS/***/RESOURCEGROUPS/PG-WORKSHOP/PROVIDERS/MICROSOFT.DBFORPOSTGRESQL/FLEXIBLESERVERS/PSQLFLEXIKHLYQLERJGTM", "category": "PostgreSQLLogs", "operationName": "LogEvent"}

Day Two Operations

Configure PgBadger

Configuring Server Parameters

Navigate to Server Parameters page in the Azure Portal and modify the following parameters:

log_line_prefix = '%t %p %l-1 db-%d,user-%u,app-%a,client-%h '  #Please mind the space at the end!
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
log_min_duration_statement=0

You can type part of the name in the search field to find them quicker:

Server Parameters

After the change hit the “Save”:

Save changed parameters

Configuring Diagnostic Settings

Navigate to Diagnostic settings page in the Azure Portal and add a new one with Storage Account destination:

Server Parameters

Choose a name of your choice for your setting, redirect the logs to the storage account (“Archive to a storage account” checkbox) and choose “PostgreSQLLogs” as a category:

Server Parameters

Hit save button.

Mounting Storage Account to VM

If you finished the pgaudit secion in the previous section you can skip to Install PgBadger

In this section you will mount Storage Account to your dns VM to be able to easier manipulate on log files.

First let’s download and install necessary packages. Feel free to simply copy and paste the following commands:

sudo rpm -Uvh https://packages.microsoft.com/config/rhel/8/packages-microsoft-prod.rpm
sudo dnf -y install blobfuse
sudo mkdir /mnt/ramdisk
sudo mount -t tmpfs -o size=16g tmpfs /mnt/ramdisk
sudo mkdir /mnt/ramdisk/blobfusetmp
sudo chown <your VM admin> /mnt/ramdisk/blobfusetmp

Authorize access to your storage account

For this exercise we will authorize with the account access keys and storing them in a config file. The config file should have the following format:

accountName myaccount
accountKey storageaccesskey
containerName insights-logs-postgresqllogs

Server Parameters

vi fuse_connection.cfg
chmod 600 fuse_connection.cfg
mkdir ~/mycontainer
sudo blobfuse ~/mycontainer --tmp-path=/mnt/resource/blobfusetmp  --config-file=/home/<your VM admin>/fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120

sudo -i
cd /home/<your VM admin>/mycontainer/
ls # check if you see mounted container
# Please use tab key for directory autocompletion; do not copy and paste!
cd resourceId\=/SUBSCRIPTIONS/<your subscription id>/RESOURCEGROUPS/PG-WORKSHOP/PROVIDERS/MICROSOFT.DBFORPOSTGRESQL/FLEXIBLESERVERS/PSQLFLEXIKHLYQLERJGTM/y\=2022/m\=06/d\=16/h\=09/m\=00/
ls
less PT1H.json 

You should see some logs being generated.

Install PgBadger

As the last step we need to install PgBadger. Feel free to copy and paste the following commands:

sudo -i
dnf install -y perl perl-devel
wget https://github.com/darold/pgbadger/archive/v11.8.tar.gz
tar xzf v11.8.tar.gz
cd pgbadger-11.8/
perl Makefile.PL
make && make install

Generate pgbadger report

Choose the file you want to generate pgBadger from and go to the directory where the chosen PT1H.json file is stored, for instance:

cd /home/pgadmin/mycontainer/resourceId=/SUBSCRIPTIONS/***/RESOURCEGROUPS/PG-WORKSHOP/PROVIDERS/MICROSOFT.DBFORPOSTGRESQL/FLEXIBLESERVERS/PSQLFLEXIKHLYQLERJGTM/y=2022/m=05/d=23/h=09/m=00

and run the following commands to extract the message value from json:

cut -f9- -d\: PT1H.json | cut -f1 -d\} | sed -e 's/^."//' -e 's/"$//' > 01
cat 01| sed 's/\\n/\n/g'>02
cat 02| sed 's/\\"/"/g'>03

You are ready to generate your first pgBadger report:

/usr/local/bin/pgbadger --prefix='%t %p %l-1 db-%d,user-%u,app-%a,client-%h ' 03 -o pgbadgerReport.html

Now you can download your report either from Azure Portal or by using scp command:

Server Parameters

Multiversion Concurrency Control, MVCC

This section describes the behavior of the PostgreSQL database system when two or more sessions try to access the same data at the same time. The goals in that situation are to allow efficient access for all sessions while maintaining strict data integrity. Every developer of database applications and DBA should be familiar with the topics covered in this chapter.

Task Hints

We will use the portal to change the PostgreSQL parameters.
Inspect the table size and see the impact of autovacuum.

Tasks

Inspect and change server paramerters

You can list, show, and update configuration parameters for an Azure Database for PostgreSQL server through the Azure portal.

Go to the Azure PostgreSQL resource

Go PostgreSQL server parameteres

Change Autovacuum server parameter to off and Save

Go PostgreSQL server parameteres

On psql or your favorite IDE/GUI

Check Autovacuum setting:

SHOW AUTOVACUUM ;

It should return this:

  postgres=> SHOW AUTOVACUUM ;
  autovacuum
  ------------
  off
  (1 row)

  postgres=>

Make sure that you have the random_data table

drop TABLE IF EXISTS random_data;

CREATE TABLE random_data AS
SELECT s                    AS first_column,
   md5(random()::TEXT)      AS second_column,
   md5((random()/2)::TEXT)  AS third_column
FROM generate_series(1,500000) s;

In case you wanted to see the path of the table:

SELECT pg_relation_filepath('random_data');

Output

  pg_relation_filepath
  ----------------------
  base/14417/16507
  (1 row)

Display the table size:

SELECT pg_relation_size('random_data');

Output

  postgres=> SELECT pg_relation_size('random_data');
  pg_relation_size
  ------------------
          50569216
  (1 row)

Display the table size in MB:

SELECT pg_size_pretty(pg_relation_size('random_data'));

Output

  postgres=> SELECT pg_size_pretty(pg_relation_size('random_data'));
  pg_size_pretty
  ----------------
  48 MB
  (1 row)

Inserting more records in the random_data

INSERT INTO random_data
SELECT s,
       md5(random() :: TEXT),
       md5((random() / 2) :: TEXT)
FROM generate_series(1, 1000000) s;

SELECT count(*) FROM random_data;

Output

  postgres=> INSERT INTO random_data
  postgres-> SELECT s,
  postgres->        md5(random() :: TEXT),
  postgres->        md5((random() / 2) :: TEXT)
  postgres-> FROM generate_series(1, 1000000) s;
  INSERT 0 1000000
  postgres=>
  postgres=> SELECT count(*) FROM random_data;
    count
  ---------
  1500000
  (1 row)  

Check the table size after the insersion

SELECT pg_size_pretty(pg_relation_size('random_data'));

Output

  postgres=> SELECT pg_size_pretty(pg_relation_size('random_data'));
  pg_size_pretty
  ----------------
  145 MB
  (1 row)

  postgres=>

Notice, that there is a signigcant increase in the table size. Now, delete all the recrods:

DELETE from random_data;

Output

  postgres=> DELETE from random_data;
  DELETE 1500000

Checking the table size after deletion of the records:

SELECT pg_size_pretty(pg_relation_size('random_data'));

Output

  postgres=> SELECT pg_size_pretty(pg_relation_size('random_data'));
  pg_size_pretty
  ----------------
  145 MB
  (1 row)  

You can see that the table size is still 145 MB, if we started the VACUUM process things would change:

VACUUM VERBOSE random_data;

Output

postgres=> VACUUM VERBOSE random_data;
INFO:  vacuuming "public.random_data"
INFO:  "random_data": removed 1500000 row versions in 18519 pages
INFO:  "random_data": found 1500000 removable, 0 nonremovable row versions in 18519 out of 18519 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 595
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.25 s, system: 0.00 s, elapsed: 0.23 s.
INFO:  "random_data": truncated 18519 to 0 pages
DETAIL:  CPU: user: 0.06 s, system: 0.01 s, elapsed: 0.04 s
INFO:  vacuuming "pg_toast.pg_toast_16507"
INFO:  index "pg_toast_16507_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "pg_toast_16507": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 596
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
postgres=> SELECT pg_size_pretty(pg_relation_size('random_data'));
pg_size_pretty
----------------
0 bytes
(1 row)

Check the table size after applying the Vacuum process:

 SELECT pg_size_pretty(pg_relation_size('random_data'));

Output

  postgres=> SELECT pg_size_pretty(pg_relation_size('random_data'));
  pg_size_pretty
  ----------------
  0 bytes
  (1 row)

SQL Characteristic

Partial Index

Let’s create a new table in quiz database and load it with some data:

CREATE SCHEMA IF NOT EXISTS games;
SET SEARCH_PATH TO games;
DROP TABLE IF EXISTS games;
CREATE TABLE games
(
    id         BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    start_time TIMESTAMP NOT NULL DEFAULT now(),
    end_time   TIMESTAMP NOT NULL,
    players    INTEGER[] NOT NULL,
    winner     INTEGER,
    is_activ   BOOLEAN            DEFAULT TRUE
);

INSERT INTO games (start_time, end_time, players, winner, is_activ)
SELECT d, d + '1 hour', ('{' || n || ',' || n + 4 || '}')::INT[], n, TRUE
FROM generate_series(1, 10000) n,
     generate_series('2021-01-01'::TIMESTAMP, now()::TIMESTAMP, '1 day'::INTERVAL) d
;

Check how many rows were added:

SELECT count(*) FROM games;

Update some of the rows so is_activ field will have value FALSE for all the rows where start_time is different than today:

UPDATE games SET is_activ = FALSE WHERE start_time::date <> CURRENT_DATE;

Check how many of the rows have flag is_activ set to TRUE:

SELECT count(*) FROM games WHERE is_activ;

Run the same query 4-5 times again and check it’s execution time:

SELECT count(*) FROM games WHERE is_activ;

Were subsequent executions faster than the first and could you explain why?

Let’s create a partial index using is_activ column:

CREATE INDEX ON games(id) WHERE is_activ;

Check again how much time this query needs to be executed:

SELECT count(*) FROM games WHERE is_activ;

JSONB

Create a table containing JSONB column and populate it with some data:

CREATE TABLE products
(
    name       TEXT,
    attributes JSONB
);

INSERT INTO products (name, attributes)
VALUES ('Leffe Blonde Ale',
        '{
          "category": "Golden / Blonde Ale",
          "region": "Belgium",
          "ABV": 6.6
        }');

Query the table using JSON operators:

SELECT name, attributes -> 'ABV'
FROM products;

SELECT name, attributes ->> 'ABV'
FROM products
WHERE attributes ->> 'region' = 'Belgium';

Create the GIN index on the table:

CREATE INDEX ON products USING gin(attributes);

Statistics and Query Planning

EXPLAIN

Recreate the random_data table:

DROP TABLE IF EXISTS random_data;

CREATE TABLE random_data
AS
SELECT s                           AS first_column,
       md5(random() :: TEXT)       AS second_column,
       md5((random() / 2) :: TEXT) AS third_column
FROM generate_series(1, 500000) s;

Run EXPLAIN command to see what’s the execution plan of ‘SELECT *’ query:

EXPLAIN TABLE random_data;

Output:

Seq Scan on random_data  (cost=0.00..11420.05 rows=524705 width=68)

Check the statistics that Postgres currently has for random_data table:

SELECT relpages, reltuples FROM pg_class WHERE relname = 'random_data';

Output:

 relpages | reltuples
----------+-----------
        0 |         0
(1 row)

Run the VACUUM ANALYZE command against random_data and check the statisctics again:

VACUUM ANALYSE random_data;

SELECT relpages, reltuples FROM pg_class WHERE relname = 'random_data';

Output:

 relpages | reltuples
----------+-----------
     6173 |    500000
(1 row)

Check the EXPLAIN output again:

EXPLAIN TABLE random_data;

Are the numbers more accurate?

Output:

                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on random_data  (cost=0.00..11173.00 rows=500000 width=70)
(1 row)

Let’s calculate the cost as Postgres query planner does:

SELECT relpages * current_setting('seq_page_cost')::numeric
           + reltuples * current_setting('cpu_tuple_cost')::numeric
FROM pg_class
WHERE relname = 'random_data';

Output:

 ?column?
----------
    11173
(1 row)

As you can see that’s the same cost as shown in the EXPLAIN output;

Let’s add a WHERE condition to the query:

EXPLAIN SELECT * FROM random_data WHERE first_column < 2000;

Output:

Gather  (cost=1000.00..9971.17 rows=1940 width=70)
  Workers Planned: 2
  ->  Parallel Seq Scan on random_data  (cost=0.00..8777.17 rows=808 width=70)
        Filter: (first_column < 2000)

Let’s add ANALYZE to EXPLAIN clause:

EXPLAIN ANALYSE SELECT * FROM random_data WHERE first_column < 2000;

Output:

Gather  (cost=1000.00..9971.17 rows=1940 width=70) (actual time=0.498..727.918 rows=1999 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on random_data  (cost=0.00..8777.17 rows=808 width=70) (actual time=0.004..83.350 rows=666 loops=3)
        Filter: (first_column < 2000)
        Rows Removed by Filter: 166000
Planning Time: 0.098 ms
Execution Time: 728.032 ms

Now not only the plan was shown but also the query was executed.

Create an index and see how the execution plan has changed:

CREATE INDEX ON random_data(first_column);

EXPLAIN ANALYZE SELECT * FROM random_data WHERE first_column < 2000;

Output:

Index Scan using random_data_first_column_idx on random_data  (cost=0.42..86.67 rows=1957 width=70) (actual time=0.012..0.330 rows=1999 loops=1)
  Index Cond: (first_column < 2000)
Planning Time: 0.107 ms
Execution Time: 0.421 ms

Why Index Scan not Index Only Scan was used?

See the execution plan for a selfjoin:

EXPLAIN ANALYZE
SELECT t5.*
FROM random_data
         JOIN random_data t5 USING (first_column)
WHERE t5.first_column < 2000;

Output:

Nested Loop  (cost=0.84..5543.32 rows=1957 width=70) (actual time=0.025..3.809 rows=1999 loops=1)
  ->  Index Scan using random_data_first_column_idx on random_data t5  (cost=0.42..86.67 rows=1957 width=70) (actual time=0.016..0.422 rows=1999 loops=1)
        Index Cond: (first_column < 2000)
  ->  Index Only Scan using random_data_first_column_idx on random_data  (cost=0.42..2.78 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1999)
        Index Cond: (first_column = t5.first_column)
        Heap Fetches: 0
Planning Time: 0.331 ms
Execution Time: 3.928 ms

Why Nested Loop was used?

Check if planner has chosen the right plan by disabling the nested loop:

SET ENABLE_NESTLOOP TO OFF;

EXPLAIN ANALYZE
SELECT t5.*
FROM random_data
         JOIN random_data t5 USING (first_column)
WHERE t5.first_column < 2000;

Output:

Gather  (cost=1111.13..10352.56 rows=1957 width=70) (actual time=0.965..114.698 rows=1999 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Hash Join  (cost=111.13..9156.86 rows=815 width=70) (actual time=35.316..67.464 rows=666 loops=3)
        Hash Cond: (random_data.first_column = t5.first_column)
        ->  Parallel Seq Scan on random_data  (cost=0.00..8256.33 rows=208333 width=4) (actual time=0.011..38.918 rows=166667 loops=3)
        ->  Hash  (cost=86.67..86.67 rows=1957 width=70) (actual time=0.786..0.787 rows=1999 loops=3)
              Buckets: 2048  Batches: 1  Memory Usage: 216kB
              ->  Index Scan using random_data_first_column_idx on random_data t5  (cost=0.42..86.67 rows=1957 width=70) (actual time=0.038..0.442 rows=1999 loops=3)
                    Index Cond: (first_column < 2000)
Planning Time: 0.219 ms
Execution Time: 114.823 ms

Disable also Hash Joins:

SET ENABLE_HASHJOIN TO OFF;

EXPLAIN ANALYZE
SELECT t5.*
FROM random_data
         JOIN random_data t5 USING (first_column)
WHERE t5.first_column < 2000;

Output:

Gather  (cost=1000.85..11896.00 rows=1957 width=70) (actual time=0.446..287.762 rows=1999 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Merge Join  (cost=0.84..10700.30 rows=815 width=70) (actual time=13.313..16.500 rows=666 loops=3)
        Merge Cond: (random_data.first_column = t5.first_column)
        ->  Parallel Index Only Scan using random_data_first_column_idx on random_data  (cost=0.42..10079.76 rows=208333 width=4) (actual time=12.803..12.873 rows=667 loops=3)
              Heap Fetches: 0
        ->  Index Scan using random_data_first_column_idx on random_data t5  (cost=0.42..86.67 rows=1957 width=70) (actual time=0.085..0.552 rows=1999 loops=3)
              Index Cond: (first_column < 2000)
Planning Time: 0.255 ms
Execution Time: 287.901 ms

Which algorithm was the fastest for this query and why?

work_mem Setting

Run EXPLAIN command to see what’s the execution plan of the SELECT query that requires sorting:

EXPLAIN ANALYSE SELECT second_column FROM random_data ORDER BY 1 DESC;

Output:

Sort  (cost=73824.53..75136.30 rows=524705 width=32) (actual time=10705.361..14201.555 rows=500000 loops=1)
  Sort Key: second_column DESC
  Sort Method: external merge  Disk: 21096kB
  ->  Seq Scan on random_data  (cost=0.00..11420.05 rows=524705 width=32) (actual time=0.018..808.240 rows=500000 loops=1)
Planning Time: 0.087 ms
Execution Time: 14378.488 ms

As you see external merge was used as a sort method. It means that your data were sorted on the disk. This is because work_mem value was to small to sort the data in memory.

Check the current value of work_mem:

SHOW work_mem;

How much memory do you need to sort the data in RAM?

You can change the work_mem value just for the session to try it out. Set the proper value and try to rerun the query.

SET work_mem = '10MB';

After chosing the right work_mem value you will see that execution plan has changed:

Sort  (cost=61270.03..62581.80 rows=524705 width=32) (actual time=11884.004..12146.057 rows=500000 loops=1)
  Sort Key: second_column DESC
  Sort Method: quicksort  Memory: 51351kB
  ->  Seq Scan on random_data  (cost=0.00..11420.05 rows=524705 width=32) (actual time=0.013..579.455 rows=500000 loops=1)
Planning Time: 0.051 ms
Execution Time: 12206.274 ms

Now quicksort Memory was used instead of external merge. Why more memory is needed for the same sort operation in memory than on the disk?

Clean up

Once you’re done with the workshop, make sure to delete the resources you created. You can read through manage Azure resources by using the Azure portal or manage Azure resources by using Azure CLI for more details.

Contributors

The following people have contributed to this workshop, thanks!

@AwdotiaRomanowna

@diaa

@damaccar

@pamir

@akumaranchath

@orenbouni