Importing a File in Batches
There are a million ways to import data into SQL Server. Most of the time, we want to ingest the new data as quickly and efficiently possible but that’s not always the case. Sometimes, we need to accept data at a rate that will not dominate resource usage on the target system or cause excessive transaction log growth. In other cases, each row from the data source needs specific server-side processing to validate and persist the data across multiple relational tables, perhaps involving foreign keys and identity columns.
All this can be achieved with client-side tools and programming. It can also be done server-side by importing the raw data into a staging table before processing using T-SQL procedures. Other times, the need arises to ingest data without using client-side tools and without making a complete copy of the raw data on the server. This article describes one possible approach in that situation.
Example Data Export
I’m going to use the Product and TransactionHistory tables from the AdventureWorks sample database. Each product and its associated transaction history will be serialized in JSON format.
To make the data extract easier, create the following helper view:
CREATE VIEW dbo.ProductTransactionExport
WITH SCHEMABINDING
AS
SELECT
Product =
(
SELECT
P.[Name],
P.ProductNumber,
P.Color,
P.Size
FOR JSON PATH
),
History =
(
SELECT
TH.ReferenceOrderID,
TH.ReferenceOrderLineID,
TH.TransactionDate,
TH.TransactionType,
TH.Quantity,
TH.ActualCost
FROM Production.TransactionHistory AS TH
WHERE
TH.ProductID = P.ProductID
FOR JSON PATH
)
FROM Production.Product AS P;
Export the data to a convenient location using bcp
:
bcp AdventureWorks2022.dbo.ProductTransactionExport out
ProductTransactionExport.dat -w -q -S .\SQL2022 -T
This creates a 32.1 MB wide character export file in a second or two.
Target Tables
I’m going to import the extracted data into two new tables. The target Product table has an identity column missing from the source data. The target TransactionHistory table has a foreign key to that identity column:
CREATE TABLE dbo.Product
(
ProductID integer IDENTITY NOT NULL PRIMARY KEY, -- new
[Name] nvarchar(50) NOT NULL,
ProductNumber nvarchar(25) NOT NULL,
Color nvarchar(15) NULL,
Size nvarchar(5) NULL
);
GO
CREATE TABLE dbo.TransactionHistory
(
ProductID integer NOT NULL
REFERENCES dbo.Product, -- foreign key
ReferenceOrderID integer NOT NULL,
ReferenceOrderLineID integer NOT NULL,
TransactionDate datetime NOT NULL,
TransactionType nchar(1) NOT NULL,
Quantity integer NOT NULL,
ActualCost money NOT NULL
);
The new tables are in the dbo
schema. The original tables remain as they were in the Production
schema.
Data Import
The export file contains two columns of JSON-formatted data. The first step of the import process is to create a view with a compatible structure:
CREATE VIEW dbo.ProductTransactionImport
WITH SCHEMABINDING
AS
SELECT
Product = CONVERT(nvarchar(max), N'')
COLLATE Latin1_General_100_BIN2,
History = CONVERT(nvarchar(max), NULL)
COLLATE Latin1_General_100_BIN2;
Data can’t be imported directly into this view directly of course. If we try, SQL Server returns an error message:
Msg 4406, Level 16, State 1
Update or insert of view or function 'dbo.ProductTransactionImport'
failed because it contains a derived or constant field.
To insert data via the view, we need an INSTEAD OF INSERT
trigger.
The Trigger
To keep things simple to start with, the following trigger processes the import one row at a time.
For each row, the code:
- Unpacks the JSON-formatted
Product
data and writes it to the target table. - If related
TransactionHistory
records are present:- Save the product identity value.
- Unpack the JSON-formatted history records.
- Store the history records using the saved identity value.
The code is quite straightforward:
CREATE OR ALTER TRIGGER ProductTransactionImport_IOI
ON dbo.ProductTransactionImport
INSTEAD OF INSERT
AS
IF ROWCOUNT_BIG() = 0 RETURN;
SET NOCOUNT, XACT_ABORT ON;
SET STATISTICS XML OFF;
DECLARE
@ProductID integer,
@Product nvarchar(max),
@History nvarchar(max);
-- Top (2) allows us to detect multi-row inserts efficiently
SELECT TOP (2)
@Product = I.Product,
@History = I.History
FROM Inserted AS I;
IF ROWCOUNT_BIG() > 1
BEGIN
THROW 50000, 'Only single row inserts are allowed.', 1;
END;
-- Add the product record
INSERT dbo.Product
([Name], ProductNumber, Color, Size)
SELECT
P.*
FROM OPENJSON(@Product COLLATE Latin1_General_100_BIN2)
WITH
(
[Name] nvarchar(50),
ProductNumber nvarchar(25),
Color nvarchar(15),
Size nvarchar(5)
) AS P;
IF @History IS NOT NULL
BEGIN
SET @ProductID = SCOPE_IDENTITY();
-- Add the related history rows
INSERT dbo.TransactionHistory
(
ProductID,
ReferenceOrderID,
ReferenceOrderLineID,
TransactionDate,
TransactionType,
Quantity,
ActualCost
)
SELECT
@ProductID,
H.*
FROM OPENJSON(@History COLLATE Latin1_General_100_BIN2)
WITH
(
ReferenceOrderID integer,
ReferenceOrderLineID integer,
TransactionDate datetime,
TransactionType nchar(1),
Quantity integer,
ActualCost money
) AS H;
END;
Performing the Import
We can import the file in batches using BULK INSERT
. The trigger is currently only set up to handle a single row at a time, so we’ll start with single-row batches:
SET NOCOUNT ON;
SET STATISTICS XML OFF;
BULK INSERT dbo.ProductTransactionImport
FROM 'C:\Temp\ProductTransactionExport.dat'
WITH
(
DATAFILETYPE = 'widechar',
FIRE_TRIGGERS,
BATCHSIZE = 1
);
The FIRE_TRIGGERS
option is required to fire the INSTEAD OF INSERT
trigger. Without it, the command will fail with the error message shown previously. This is a useful defence against forgetting to specify the option.
The BATCHSIZE = 1
parameter tells SQL Server to process each line of the file using a separate transaction.
The net effect is to stream the file into SQL Server, firing the import processing trigger for each import line. A hidden temporary rowset is created to support the inserted trigger alias, but it only ever contains a single row.
The BULK INSERT
command above successfully imports 504 Product
rows and 113,443 TransactionHistory
rows in just under two seconds. Each product entry receives a new identity value, the history rows are correctly linked, and the foreign key is marked as trusted.
This solution can be expanded to allow arbitrarily complex processing of per-row data, including adding a WAITFOR
delay or other throttling mechanism to deliver trickle inserts to a highly contended system.
Larger Batch Sizes
The single-row batch size makes for a simple trigger implementation but may add a fair amount of overhead to the task as a whole. In the example, the effect isn’t too pronounced because each import row is quite large and may contain a decent number of target history rows in the JSON array.
In other circumstances, you may prefer to use a larger batch size in the BULK INSERT
statement. This means using a cursor or set-based approach inside the trigger.
The cursor implementation is a straightforward extension of the single row code, driven from the inserted alias. Sadly, SQL Server doesn’t support a dynamic cursor on inserted so the cursor will make a copy of the current batch of rows in tempdb. This is the second copy of the batch since one was created for the rowset supporting the inserted alias, which is a little unfortunate. Still, the batch will normally not be too large. It might still be better than making a copy of the entire imported data all at once in a staging table, depending on your goals and priorities.
A set-based approach would work the same as a single large import, which normally involves a temporary table and a MERGE
statement to capture assigned identity values in a mapping table. Again, the benefit is this processing occurs per batch rather than for the entire import all at once. I’ll show an example of this next.
Multi-row Trigger
An implementation of a multi-row trigger using MERGE
and a mapping table:
-- Trigger capable of handling multi-row batches
CREATE OR ALTER TRIGGER ProductTransactionImport_IOI
ON dbo.ProductTransactionImport
INSTEAD OF INSERT
AS
IF ROWCOUNT_BIG() = 0 RETURN;
SET NOCOUNT, XACT_ABORT ON;
SET STATISTICS XML OFF;
-- Capture current batch with a surrogate key
-- This step can be skipped if the import contains
-- a *reliable* key
SELECT
RowID = IDENTITY(integer, 1, 1),
I.Product,
I.History
INTO #Batch
FROM Inserted AS I;
-- Used to map surrogates to assigned identity values
CREATE TABLE #Map
(
RowID integer NOT NULL PRIMARY KEY,
ProductID integer NOT NULL
);
-- Add a batch of products
WITH I AS
(
SELECT
B.RowID,
P.*
FROM #Batch AS B
CROSS APPLY OPENJSON(B.Product COLLATE Latin1_General_100_BIN2)
WITH
(
[Name] nvarchar(50),
ProductNumber nvarchar(25),
Color nvarchar(15),
Size nvarchar(5)
) AS P
)
MERGE dbo.Product AS P
USING I ON 0 = 1
WHEN NOT MATCHED BY TARGET THEN
INSERT ([Name], ProductNumber, Color, Size)
VALUES (I.[Name], I.ProductNumber, I.Color, I.Size)
OUTPUT
-- Capture surrogate to identity mapping
I.RowID, Inserted.ProductID
INTO #Map (RowID, ProductID);
-- Add a batch of history rows
INSERT dbo.TransactionHistory
WITH (TABLOCK)
(
ProductID,
ReferenceOrderID,
ReferenceOrderLineID,
TransactionDate,
TransactionType,
Quantity,
ActualCost
)
SELECT
M.ProductID,
H.*
FROM #Batch AS B
CROSS APPLY OPENJSON(B.History COLLATE Latin1_General_100_BIN2)
WITH
(
ReferenceOrderID integer,
ReferenceOrderLineID integer,
TransactionDate datetime,
TransactionType nchar(1),
Quantity integer,
ActualCost money
) AS H
JOIN #Map AS M
ON M.RowID = B.RowID;
Bulk insert processing the import in batches of 100 rows:
-- Reset target tables
TRUNCATE TABLE dbo.TransactionHistory;
DELETE dbo.Product;
SET NOCOUNT ON;
SET STATISTICS XML OFF;
BULK INSERT dbo.ProductTransactionImport
FROM 'C:\Temp\ProductTransactionExport.dat'
WITH
(
DATAFILETYPE = 'widechar',
FIRE_TRIGGERS,
BATCHSIZE = 100
);
This code also correctly imports and links all data from the file.
Clean up by dropping the new tables and views:
DROP VIEW IF EXISTS
dbo.ProductTransactionExport,
dbo.ProductTransactionImport;
DROP TABLE IF EXISTS
dbo.TransactionHistory,
dbo.Product;
Thanks for reading.