Tuesday, August 28, 2018

SSIS - Import Mixed Data Types From MS Excel (Excel Driver IMEX=1)

MS Excel - Data Type Issue


Reading MS Excel files from SSIS is pretty intuitive, but there are also challenges while dealing with different data types coming from source files.
The way how SSIS fetches the data from Excel is through one of the available drivers, which needs to be installed on the machine running SSIS packages to get this connection working. These are the most common used drivers:

  • Microsoft Jet OLE DB 4.0: Excel versions 97-2003 (.xls extension)
  • Microsoft ACE OLEDB 12.0: Excel versions 2007- (.xlsx extension

Please, refer to Excel connection strings and the full list of drivers HERE

By default Excel driver will read first 8 rows of data and based on 8 row dataset will try to guess data type for each of the columns from the source file. That will not be an issue at all in the case each column contains well typed data accross all of the rows like integer, decimal, text. etc. 
Also, Excel driver has its limitations when detecting data types and it will follow the mapping:

  • Numeric – double-precision float (DT_R8)
  • Currency – currency (DT_CY)
  • Boolean – Boolean (DT_BOOL)
  • Date/time – datetime (DT_DATE)
  • String – Unicode string, length 255 (DT_WSTR)
  • Memo – Unicode text stream (DT_NTEXT)

In our case of mixed data types, Excel will read those 8 rows and for example, it may determine that data is type of integer, but somewhere down below in the same row the file might contain data of some other type. In that case SSIS will either fail or skip text values and import them as NULL into destination. In the case that first row contains header (property "HDR=YES" enabled) it will skip header and look for the next 8 rows.

The example that we will use in this demo shows similar scenario while determining data types and translating them according to the mapping table above:

  1. Column A - first 8 rows (2-9 after skipping the header) are numeric -> DT_R8.
  2. Column B - first 8 rows (2-9 after skipping the header) are String -> DT_WSTR.
  3. Column C - first 8 rows (2-9 after skipping the header) are number -> DT_R8. This will work for rows 2-9, but not for 12-16 where again, it will either fail the package or it will insert NULL value to the destination.

Our example will shows survey data with Response column included which may contain both, numerical and text data:




Quick Solution


Here is a quick solution and the full demo below:
1. Change Excel column type to "Text" and add IMEX extended property to your connection string to override default behavior of the driver and in that way Excel will treat data as text after adding ";IMEX=1" extended property at the end of your connection string:

ACE OLEDB:
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\ExcelFileName.xlsx;Extended Properties="EXCEL 12.0 XML;HDR=YES;IMEX=1";
JET:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\ExcelFileName.xls;Extended Properties="Excel 8.0;HDR=Yes;IMEX=1";

2. Save the file as CSV and change SSIS package to support CSV import (only if this is an option in a given business scenario) - CSV is way better in handling data types, but if changing data type is not possible in your business case refer to solution 1


Sample files


The sample files and tutorials are provided at: SSIS - Import Mixed Data Types From MS Excel 

The following software have been used:
- SQL scripts provided - any version of SQL Server
- SSIS Visual Studio solution - Visual Studio 2017, Target SQL Version 2014
- MS Excel 2007 file version (.xlsx)


Additional Troubleshooting

There are certain cases when IMEX will not work by itself and then registry change to "TypeGuessRows" key is needed. The default value for this key is 8 and it needs to be changed to 0, which will give a directive to Excel driver to scan all of the rows to determine the data type.

Depending on the driver which is installed on the target machine the key can be found at one of the following paths:
- Excel 97-2003: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel\TypeGuessRows 
- Excel 2007: HKEY_LOCAL_MACHINE\Software\Microsoft\Office\12.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows
- Excel 2010: HKEY_LOCAL_MACHINE\Software\Microsoft\Office\14.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows
- Excel 2013: HKEY_LOCAL_MACHINE\Software\Microsoft\Office\15.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows
- Excel 2016: HKEY_LOCAL_MACHINE\Software\Microsoft\Office\16.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows



In my case, Microsoft Access Database Engine 2010 Redistributable is installed and that means version 14.0.




SSIS IMEX Tutorial



  • Create test SQL database

USE master
go 
IF NOT EXISTS( SELECT * FROM sys.databases WHERE [name] like 'BI_DEV')
       CREATE DATABASE [BI_DEV]
GO
  • Create SQL landing table
USE BI_DEV
GO 
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Lnd_SurveyData]') AND type in (N'U'))
BEGIN
CREATE TABLE [dbo].[Lnd_SurveyData](
       [ResponseID] [int] NULL,
       [Question] [nvarchar](2000) NULL,
       [Response] [nvarchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
END
GO
  • Create SSIS project in Visual Studio:



  • Create Data Flow Task and name it "DF_Land_Survey_Data_Excel":





  • Create Excel data source: 



  • Specify location of the file:









  • Create destination:



  •  Connect source and destination:



  •  Select destination table and refresh metadata:




  •  Rename destination:


  •  Add table truncate before the load starts - Execute SQL Task:







  • Rename Execute SQL Task:




  • Rename the package:





  • Run and verify the load - Null values loaded incorrectly instead of expected text data


**script used in the following section:
USE BI_DEV
GO 
select from dbo.[Lnd_SurveyData]


  •  Excel - Change data type for column C from "General" to "Text"




  • Append IMEX property to connection string




  • Refresh metadata:






  • Run the package and test your result:



**script used in the following section:
USE BI_DEV
GO
select from dbo.[Lnd_SurveyData]



No comments:

Post a Comment