This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Download Microsoft Edge
More info about Internet Explorer and Microsoft Edge
This article explains how to connect to Azure Data Lake Storage Gen2 and Blob Storage from Azure Databricks. For Azure Data Lake Storage Gen2 FAQs and known issues, see
Azure Data Lake Storage Gen2 FAQ
.
Databricks recommends using Unity Catalog external locations and Azure managed identities to connect to Azure Data Lake Storage Gen2. You can also set Spark properties to configure a Azure credentials to access Azure storage. For a tutorial on connecting to Azure Data Lake Storage Gen2 with a service principal, see
Tutorial: Connect to Azure Data Lake Storage Gen2
.
The legacy Windows Azure Storage Blob driver (WASB) has been deprecated. ABFS has numerous benefits over WASB. See
Azure documentation on ABFS
. For documentation for working with the legacy WASB driver, see
Connect to Azure Blob Storage with WASB (legacy)
.
Azure has announced the pending retirement of
Azure Data Lake Storage Gen1
. Azure Databricks recommends migrating all data from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2. If you have not yet migrated, see
Accessing Azure Data Lake Storage Gen1 from Azure Databricks
.
Connect to Azure Data Lake Storage Gen2 with Unity Catalog
Azure Data Lake Storage Gen2 is the only Azure storage type supported by Unity Catalog.
External locations and storage credentials
allow Unity Catalog to read and write data in Azure Data Lake Storage Gen2 on behalf of users.
Administrators primarily use external locations to configure Unity Catalog
external tables
.
A storage credential is used for authentication to Azure Data Lake Storage Gen2. It can be either an Azure managed identity or a service principal. Databricks strongly recommends using an Azure managed identity. An external location is an object that combines a cloud storage path with a storage credential.
Who can manage external locations and storage credentials?
The Azure user who creates an Azure managed identity for the storage credential must:
Belong to at least one Azure Databricks workspace in your Azure tenant.
Be a Contributor or Owner of an Azure resource group.
If you create a new service principal for the storage credential, you must have the
Application Administrator
role or the
Application.ReadWrite.All
permission in Azure Active Directory .
The Azure user who grants the managed identity to the storage account must:
Be an Owner or a user with the User Access Administrator Azure RBAC role on the storage account.
The Azure Databricks user who creates the storage credential in Unity Catalog must:
Be an Azure Databricks account admin.
The Azure Databricks user who creates the external location in Unity Catalog must:
Be a metastore admin or a user with the
CREATE EXTERNAL LOCATION
privilege.
After you create an external location in Unity Catalog, you can can grant the following permissions on it:
CREATE TABLE
READ FILES
WRITE FILES
These permissions enable Azure Databricks users to access data in Azure Data Lake Storage Gen2 without managing cloud storage credentials for authentication.
For more information, see
Manage external locations and storage credentials
.
Access Azure Data Lake Storage Gen2 with Unity Catalog external locations
Use the fully qualified ABFS URI to access data secured with Unity Catalog.
Warning
Unity Catalog ignores Spark configuration settings when accessing data managed by external locations.
Examples of reading:
dbutils.fs.ls("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data")
spark.read.format("parquet").load("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data")
spark.sql("SELECT * FROM parquet.`abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data`")
Examples of writing:
dbutils.fs.mv("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data", "abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/new-location")
df.write.format("parquet").save("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/new-location")
Examples of creating external tables:
df.write.option("path", "abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/table").saveAsTable("my_table")
spark.sql("""
CREATE TABLE my_table
LOCATION "abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/table"
AS (SELECT *
FROM parquet.`abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data`)
Connect to Azure Data Lake Storage Gen2 or Blob Storage using Azure credentials
The following credentials can be used to access Azure Data Lake Storage Gen2 or Blob Storage:
OAuth 2.0 with an Azure service principal: Databricks recommends using Azure service principals to connect to Azure storage. To create an Azure service principal and provide it access to Azure storage accounts, see Access storage with Azure Active Directory.
To create an Azure service principal, you must have the Application Administrator
role or the Application.ReadWrite.All
permission in Azure Active Directory. To assign roles on a storage account you must be an Owner or a user with the User Access Administrator Azure RBAC role on the storage account.
Shared access signatures (SAS): You can use storage SAS tokens to access Azure storage. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access control.
You can only grant a SAS token permissions that you have on the storage account, container, or file yourself.
Account keys: You can use storage account access keys to manage access to Azure Storage. Storage account access keys provide full access to the configuration of a storage account, as well as the data. Databricks recommends using an Azure service principal or a SAS token to connect to Azure storage instead of account keys.
To view an account’s access keys, you must have the Owner, Contributor, or Storage Account Key Operator Service role on the storage account.
Databricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the Azure credentials while allowing users to access Azure storage. To create a secret scope, see Secret scopes.
You can set Spark properties to configure a Azure credentials to access Azure storage. The credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to Azure storage. See Cluster access control and Workspace object access control.
Azure service principals can also be used to access Azure storage from a SQL warehouse, see Enable data access configuration.
To set Spark properties, use the following snippet in a cluster’s Spark configuration or a notebook:
Azure service principal
service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
Replace
<scope>
with the Databricks secret scope name.
<service-credential-key>
with the name of the key containing the client secret.
<storage-account>
with the name of the Azure storage account.
<application-id>
with the Application (client) ID for the Azure Active Directory application.
<directory-id>
with the Directory (tenant) ID for the Azure Active Directory application.
Sas tokens
You can configure SAS tokens for multiple storage accounts in the same Spark session.
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))
Replace
<storage-account>
with the Azure Storage account name.
<scope>
with the Azure Databricks secret scope name.
<sas-token-key>
with the name of the key containing the Azure storage SAS token.
Account key
spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
Replace
<storage-account>
with the Azure Storage account name.
<scope>
with the Azure Databricks secret scope name.
<storage-account-access-key>
with the name of the key containing the Azure storage account access key.
Access Azure storage
Once you have properly configured credentials to access your Azure storage container, you can interact with resources in the storage account using URIs. Databricks recommends using the abfss
driver for greater security.
spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
CREATE TABLE <database-name>.<table-name>;
COPY INTO <database-name>.<table-name>
FROM 'abfss://container@storageAccount.dfs.core.windows.net/path/to/folder'
FILEFORMAT = CSV
COPY_OPTIONS ('mergeSchema' = 'true');
Example notebook
ADLS Gen2 OAuth 2.0 with Azure service principals notebook
Get notebook
Deprecated patterns for storing and accessing data from Azure Databricks
The following are deprecated storage patterns:
Databricks no longer recommends mounting external data locations to Databricks Filesystem. See Mounting cloud object storage on Azure Databricks.
Databricks no longer recommends using credential passthrough with Azure Data Lake Storage Gen2. See Access Azure Data Lake Storage using Azure Active Directory credential passthrough (legacy).