WITH (
--[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
[country_name] VARCHAR (100) COLLATE Latin1_General_BIN2 2
--[year] smallint,
--[population] bigint
json_path = JSON path expression to column or nested property. Default path mode is lax.
In strict mode query will fail with error if provided path does not exist. In lax mode query will succeed and JSON path expression will evaluate to NULL.
<bulk_options>
FIELDTERMINATOR ='field_terminator'
Specifies the field terminator to be used. The default field terminator is a comma (",").
ROWTERMINATOR ='row_terminator'`
Specifies the row terminator to be used. If row terminator is not specified, one of default terminators will be used. Default terminators for PARSER_VERSION = '1.0' are \r\n, \n and \r. Default terminators for PARSER_VERSION = '2.0' are \r\n and \n.
When you use PARSER_VERSION='1.0' and specify \n (newline) as the row terminator, it will be automatically prefixed with a \r (carriage return) character, which results in a row terminator of \r\n.
ESCAPE_CHAR = 'char'
Specifies the character in the file that is used to escape itself and all delimiter values in the file. If the escape character is followed by a value other than itself, or any of the delimiter values, the escape character is dropped when reading the value.
The ESCAPECHAR parameter will be applied regardless of whether the FIELDQUOTE is or isn't enabled. It won't be used to escape the quoting character. The quoting character must be escaped with another quoting character. Quoting character can appear within column value only if value is encapsulated with quoting characters.
FIRSTROW = 'first_row'
Specifies the number of the first row to load. The default is 1 and indicates the first row in the specified data file. The row numbers are determined by counting the row terminators. FIRSTROW is 1-based.
FIELDQUOTE = 'field_quote'
Specifies a character that will be used as the quote character in the CSV file. If not specified, the quote character (") will be used.
DATA_COMPRESSION = 'data_compression_method'
Specifies compression method. Supported in PARSER_VERSION='1.0' only. Following compression method is supported:
PARSER_VERSION = 'parser_version'
Specifies parser version to be used when reading files. Currently supported CSV parser versions are 1.0 and 2.0:
PARSER_VERSION = '1.0'
PARSER_VERSION = '2.0'
CSV parser version 1.0 is default and feature rich. Version 2.0 is built for performance and does not support all options and encodings.
CSV parser version 1.0 specifics:
Following options aren't supported: HEADER_ROW.
Default terminators are \r\n, \n and \r.
If you specify \n (newline) as the row terminator, it will be automatically prefixed with a \r (carriage return) character, which results in a row terminator of \r\n.
CSV parser version 2.0 specifics:
Not all data types are supported.
Maximum character column length is 8000.
Maximum row size limit is 8 MB.
Following options aren't supported: DATA_COMPRESSION.
Quoted empty string ("") is interpreted as empty string.
DATEFORMAT SET option is not honored.
Supported format for DATE data type: YYYY-MM-DD
Supported format for TIME data type: HH:MM:SS[.fractional seconds]
Supported format for DATETIME2 data type: YYYY-MM-DD HH:MM:SS[.fractional seconds]
Default terminators are \r\n and \n.
HEADER_ROW = { TRUE | FALSE }
Specifies whether a CSV file contains header row. Default is FALSE.
Supported in PARSER_VERSION='2.0'. If TRUE, the column names will be read from the first row according to FIRSTROW argument. If TRUE and schema is specified using WITH, binding of column names will be done by column name, not ordinal positions.
DATAFILETYPE = { 'char' | 'widechar' }
Specifies encoding: char
is used for UTF8, widechar
is used for UTF16 files.
CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' }
Specifies the code page of the data in the data file. The default value is 65001 (UTF-8 encoding). See more details about this option here.
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
This option will disable the file modification check during the query execution, and read the files that are updated while the query is running. This is useful option when you need to read append-only files that are appended while the query is running. In the appendable files, the existing content is not updated, and only new rows are added. Therefore, the probability of wrong results is minimized compared to the updateable files. This option might enable you to read the frequently appended files without handling the errors. See more information in querying appendable CSV files section.
Reject Options
Rejected rows feature is in Public Preview.
Please note that rejected rows feature works for delimited text files and PARSER_VERSION 1.0.
You can specify reject parameters that determine how service will handle dirty records it retrieves from the external data source. A data record is considered 'dirty' if actual data types don't match the column definitions of the external table.
When you don't specify or change reject options, service uses default values. Service will use the reject options to determine the number of rows that can be rejected before the actual query fails. The query will return (partial) results until the reject threshold is exceeded. It then fails with the appropriate error message.
MAXERRORS = reject_value
Specifies the number of rows that can be rejected before the query fails. MAXERRORS must be an integer between 0 and 2,147,483,647.
ERRORFILE_DATA_SOURCE = data source
Specifies data source where rejected rows and the corresponding error file should be written.
ERRORFILE_LOCATION = Directory Location
Specifies the directory within the DATA_SOURCE, or ERROR_FILE_DATASOURCE if specified, that the rejected rows and the corresponding error file should be written. If the specified path doesn't exist, service will create one on your behalf. A child directory is created with the name "rejectedrows". The "" character ensures that the directory is escaped for other data processing unless explicitly named in the location parameter. Within this directory, there's a folder created based on the time of load submission in the format YearMonthDay_HourMinuteSecond_StatementID (Ex. 20180330-173205-559EE7D2-196D-400A-806D-3BF5D007F891). You can use statement id to correlate folder with query that generated it. In this folder, two files are written: error.json file and the data file.
error.json file contains json array with encountered errors related to rejected rows. Each element representing error contains following attributes:
Attribute
Description
Value
Rejected column value. If the value is larger than 100 characters, only the first 100 characters will be displayed.
Path to file that row belongs to.
Fast delimited text parsing
There are two delimited text parser versions you can use. CSV parser version 1.0 is default and feature rich while parser version 2.0 is built for performance. Performance improvement in parser 2.0 comes from advanced parsing techniques and multi-threading. Difference in speed will be bigger as the file size grows.
Automatic schema discovery
You can easily query both CSV and Parquet files without knowing or specifying schema by omitting WITH clause. Column names and data types will be inferred from files.
Parquet files contain column metadata, which will be read, type mappings can be found in type mappings for Parquet. Check reading Parquet files without specifying schema for samples.
For the CSV files, column names can be read from header row. You can specify whether header row exists using HEADER_ROW argument. If HEADER_ROW = FALSE, generic column names will be used: C1, C2, ... Cn where n is number of columns in file. Data types will be inferred from first 100 data rows. Check reading CSV files without specifying schema for samples.
Have in mind that if you are reading number of files at once, the schema will be inferred from the first file service gets from the storage. This can mean that some of the columns expected are omitted, all because the file used by the service to define the schema did not contain these columns. In that case, please use OPENROWSET WITH clause.
Important
There are cases when appropriate data type cannot be inferred due to lack of information and larger data type will be used instead. This brings performance overhead and is particularly important for character columns which will be inferred as varchar(8000). For optimal performance, please check inferred data types and use appropriate data types.
Type mapping for Parquet
Parquet and Delta Lake files contain type descriptions for every column. The following table describes how Parquet types are mapped to SQL native types.
INT64
TIMESTAMP (not normalized to utc) (MILLIS / MICROS)
bigint - make sure that you explicitly adjust bigint
value with the timezone offset before converting it to a datetime value.
INT64
TIMESTAMP (NANOS)
Not supported
Complex type
varchar(8000), serialized into JSON
Complex type
varchar(8000), serialized into JSON
Examples
Read CSV files without specifying schema
The following example reads CSV file that contains header row without specifying column names and data types:
SELECT
FROM OPENROWSET(
BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE) as [r]
The following example reads CSV file that doesn't contain header row without specifying column names and data types:
SELECT
FROM OPENROWSET(
BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0') as [r]
Read Parquet files without specifying schema
The following example returns all columns of the first row from the census data set, in Parquet format, and without specifying column names and data types:
SELECT
TOP 1 *
OPENROWSET(
BULK 'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=20*/*.parquet',
FORMAT='PARQUET'
) AS [r]
Read Delta Lake files without specifying schema
The following example returns all columns of the first row from the census data set, in Delta Lake format, and without specifying column names and data types:
SELECT
TOP 1 *
OPENROWSET(
BULK 'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=20*/*.parquet',
FORMAT='DELTA'
) AS [r]
Read specific columns from CSV file
The following example returns only two columns with ordinal numbers 1 and 4 from the population*.csv files. Since there's no header row in the files, it starts reading from the first line:
SELECT
FROM OPENROWSET(
BULK 'https://sqlondemandstorage.blob.core.windows.net/csv/population/population*.csv',
FORMAT = 'CSV',
FIRSTROW = 1
WITH (
[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2 1,
[population] bigint 4
) AS [r]
Read specific columns from Parquet file
The following example returns only two columns of the first row from the census data set, in Parquet format:
SELECT
TOP 1 *
OPENROWSET(
BULK 'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=20*/*.parquet',
FORMAT='PARQUET'
WITH (
[stateName] VARCHAR (50),
[population] bigint
) AS [r]
Specify columns using JSON paths
The following example shows how you can use JSON path expressions in WITH clause and demonstrates difference between strict and lax path modes:
SELECT
TOP 1 *
OPENROWSET(
BULK 'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=20*/*.parquet',
FORMAT='PARQUET'
WITH (
--lax path mode samples
[stateName] VARCHAR (50), -- this one works as column name casing is valid - it targets the same column as the next one
[stateName_explicit_path] VARCHAR (50) '$.stateName', -- this one works as column name casing is valid
[COUNTYNAME] VARCHAR (50), -- STATEname column will contain NULLs only because of wrong casing - it targets the same column as the next one
[countyName_explicit_path] VARCHAR (50) '$.COUNTYNAME', -- STATEname column will contain NULLS only because of wrong casing and default path mode being lax
--strict path mode samples
[population] bigint 'strict $.population' -- this one works as column name casing is valid
--,[population2] bigint 'strict $.POPULATION' -- this one fails because of wrong casing and strict path mode
AS [r]
Specify multiple files/folders in BULK path
The following example shows how you can use multiple file/folder paths in BULK parameter:
SELECT
TOP 10 *
OPENROWSET(
BULK (
'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=2000/*.parquet',
'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=2010/*.parquet'
FORMAT='PARQUET'
AS [r]
Next steps
For more samples, see the query data storage quickstart to learn how to use OPENROWSET
to read CSV, PARQUET, DELTA LAKE, and JSON file formats. Check best practices for achieving optimal performance. You can also learn how to save the results of your query to Azure Storage using CETAS.