2020-Sep

SQL Datafix

Story

Consider the following relational database: FooDatabase

dbo.FooDatabase

In the context of this database a tenant can only be one of App One, App Two or User. Each User has their own unique GUID and the applications have the following GUIDs:

Application Name	GUID
App One	3ba2e2e8-f093-41d8-8daf-64d5385604ac
App Two	efec5b17-47fc-4d79-ae74-6e8b023023d5

13 333 records in dbo.item_line have been persisted incorrectly. The column dbo.item_line.tenant_id has been populated with the GUID for App One for the period 01 Jan 2018 to 31 Dec 2018 when it should have been populated with the GUID for the User tenant. This date is stored in dbo.item_line.created_date_uct.

Thankfully dbo.item_line.created_by_user_id is a mandatory field and will contain valid user GUIDs. Note that not all users will exist in dbo.tenant.

The cleanup must be idempotent, this means it can be run multiple times without changing the result beyond the initial story/task.

Task

Cleanup the data using a SQL script in a safe manner.

Identify all records in dbo.item_line that need to be updated, print the count.
Create records in dbo.tenant that dont exist when linking to dbo.item_line.created_by_user_id, print this count too.
Update dbo.item_line.tenant_id to be the newly created dbo.tenant.id when it did not exist, else update to be dbo.item_line.created_by_user_id. Print the total count of updates. This should match the first count printed.

Using the supplied data the prints and counts should be:

1
2
3

total dbo.item_line to be updated : 13333
total new dbo.tenant records : 12983
actual updates to dbo.item_line : 13333

Setup

Spin up a MS SQL Instance, you can then connect to it from Microsoft SQL Server Management Studio on localhost,1500

1	docker run --name=rosetta-df -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=Password123" -p 1500:1433 -d mcr.microsoft.com/mssql/server:2017-latest

Run these SQL scripts in the following order:

001 CREATE DATABASE.sql
002 CREATE TABLES.sql
003 SEED dbo.tenant.sql
004 SEED dbo.line_item.sql
005 SEED dbo.tenant.sql

This should get the data into a state that matches the counts above.

Solutions

WIP

Set Operations

TLDR; A set is an exactly defined collection of objects, this can be accomplished by selecting into a #temptable.

What are set operations?

SQL set operations are a set of operations that allow you to combine or compare the results of two or more SELECT statements into a single result set. The four main SQL set operations are:

UNION: The UNION operator combines the result sets of two or more SELECT statements into a single result set, eliminating any duplicates. For example, if you have two tables with the same structure and you want to combine their contents into a single table, you can use the UNION operator.

UNION ALL: The UNION ALL operator is similar to UNION, but it does not eliminate duplicates. It combines all rows from the result sets of two or more SELECT statements into a single result set.

INTERSECT: The INTERSECT operator returns the common rows between two or more SELECT statements. It returns only the rows that appear in all of the SELECT statements.

EXCEPT: The EXCEPT operator returns the rows that appear in the first SELECT statement but not in the second SELECT statement. It essentially subtracts the result set of the second SELECT statement from the result set of the first SELECT statement.

These set operations are useful for performing complex queries that involve combining or comparing data from multiple tables or views.

Cursor

TLDR; Use a STATIC cursor to iterate over the data.

What is a Cursor?

In SQL, a cursor is a database object that allows you to retrieve and manipulate a set of rows returned by a SELECT statement. A cursor is like a pointer to a specific row in the result set, which you can use to navigate through the rows one at a time and perform operations on each row.

To use a cursor in SQL, you first declare it and define the SELECT statement that it will be based on. Then you open the cursor, which causes the database to execute the SELECT statement and retrieve the result set. You can then use FETCH statements to retrieve individual rows from the result set, and use UPDATE, DELETE, or other SQL statements to modify the data.

Cursors can be useful in situations where you need to iterate through a result set and perform operations on each row individually. For example, you might use a cursor to process each row of a table one at a time, performing calculations or updates on the data as you go. Cursors can also be used in stored procedures, triggers, and other database objects to perform more complex operations on data. However, it’s important to use cursors judiciously, since they can be less efficient and slower than set-based operations in SQL.

WITH common_table_expression

TLDR; Specifies a temporary named result set, known as a common table expression (CTE).

What is a common_table_expression?

A common table expression (CTE) in SQL is a named temporary result set that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement. A CTE is defined using a WITH clause, which includes a SELECT statement that defines the result set for the CTE.

A CTE is similar to a subquery or derived table, but it has some distinct advantages. For example, a CTE can be referenced multiple times within the same query, making it easier to write complex queries that would otherwise require multiple subqueries or derived tables. Additionally, a CTE can improve query performance by allowing the database to reuse the result set for multiple parts of the query.

Here is an example of a simple CTE:

WITH my_cte AS (
    SELECT column1, column2
    FROM my_table
    WHERE column3 = 'some_value'
)
SELECT *
FROM my_cte
WHERE column2 > 100

In this example, the CTE is defined using a SELECT statement that retrieves data from the my_table table. The result set for the CTE includes only the rows where column3 is equal to ‘some_value’. The SELECT statement that references the CTE then filters the result set further to include only the rows where column2 is greater than 100.

CTEs are a powerful feature of SQL that can simplify complex queries and improve performance. They are widely supported by most modern relational database systems, including SQL Server, MySQL, Oracle, and PostgreSQL.