Databricks

Data Integrity Suite

Product
Spatial_Analytics
Data_Integration
Data_Enrichment
Data_Governance
Precisely_Data_Integrity_Suite
geo_addressing_1
Data_Observability
Data_Quality
dis_core_foundation
Services
Spatial Analytics
Data Integration
Data Enrichment
Data Governance
Geo Addressing
Data Observability
Data Quality
Core Foundation
ft:title
Data Integrity Suite
ft:locale
en-US
PublicationType
pt_product_guide
copyrightfirst
2000
copyrightlast
2026
  • Pipeline engine name: Enter a meaningful name for the pipeline engine.
  • Type: Specifies Databricks as the datasource type for this pipeline engine.
  • Datasource: Lists the available datasources for the selected datasource type.
  • Connection: Lists the available connections for the selected datasource.
  • Job processing instance pool: Specifies the Databricks instance pool in the Databricks environment to process jobs on the specified connection.
  • Fileshare path: Specifies the path to where artifacts are downloaded during pipeline execution. Files are downloaded during the initial execution of a pipeline engine. A file is not downloaded if it already exists in the fileshare location.
  • Enrich datasets catalog (optional) : Specifies the name of the data share catalog in the Databricks environment. Use the exact data share catalog name to access these datasets while running a pipeline with Enrich step. This provides you with an improved flexibility and customization of saving the data share under one name.
  • Cluster Configuration
    • Auto-Scale: Specify the minimum and maximum worker nodes that should be used to handle the workload. When selected, Databricks automatically adds or removes worker nodes to match the current demand for computing resources.
    • Single Node Cluster: When selected, all tasks and computations are performed on a single machine. Single Node Clusters are often used for lightweight workloads where high performance is not a primary concern.
  • Advanced Settings
    • Cluster logs path: Sets the location of cluster logs that helps you to troubleshoot unexpected issues. You can choose to set a different path or keep it same as the fileshare path.
    • Spark properties: This option allows you to specify Spark properties to determine compute, memory, and disk resources to allocate to Databricks batch workloads.
    • Click Add Property to add a key-value pair.
    • Key: Specifies the Spark property, such as spark.<name-of-property>.
    • Value: Specifies a valid value for the property, such as 3.

    Example, config("spark.executor.memory", "4g") This sets the executor memory to 4GB

    For more information, see Spark configuration in the Databricks documentation

Suggested spark properties to reduce disk usage

Consider implementing the following spark properties to help minimize disk requirements during Databricks profile executions:
  • spark.shuffle.compress: Set to true
  • spark.shuffle.spill.compress: Set to true
  • spark.io.compression.codec: Use zstd
  • spark.io.compression.zstd.level: Set to 3

Enabling these properties ensures that data written to disk during spark jobs is compressed, effectively reducing overall disk usage.

Before using "Enrich datasets catalog" for Databricks

Note: To subscribe to data or to create data shares in Data Integrity Suite workspace, contact your Precisely support representative. Depending on the subscribed platform, customers can send an email to the Databricks Partnership (databricks.partnership@precisely.com) or to the Snowflake Partnership (snowflake.partnership@precisely.com) to provision subscribed data. For more information about how to view data shares in the Databricks environment, see Read data shared using Databricks-to-Databricks Delta Sharing (Databricks documentation). For more information about how to view data shares in the Snowflake environment, see Data Consumers (Snowflake documentation).
  1. Set up data share: Ensure you have set up a data share that contains the datasets you intend to use for data enrichment. This share should include all the relevant datasets required for the Enrich step.
  2. Create catalog from a data share: For the first-time user, it's crucial to create a catalog from the data share within your workspace. To do this, navigate to the workspace's Data > Delta Sharing > Shared with me section and locate the data share containing the enrich datasets. Click Create catalog associated with the data share. This creates a catalog that includes the share within your workspace, making it accessible for future enrichment steps.
  3. Name your data share: While creating a catalog, you'll have the option to provide a name for the catalog. This name helps you identify the specific dataset collection associated with the Enrich step.
  4. Access the catalog: Once the data share catalog is created, you can access the datasets within your workspace's Data section. The datasets will be organized under the catalog name you provided earlier.
  5. Grant permissions to the user for the datasets: After the dataset is available in the Data section, you must authorize the access to these datasets for specific users.
    1. Select the dataset that you want to provide permission.
    2. On the Permissions tab, select Grant.
    3. In the Principals field, type and select the name of the user you want to provide permissions.
    4. Select the checkboxes SELECT, USE CATALOG, and USE SCHEMA and click Grant.
      Note: You must grant the above permissions to the same user or role that is used in the connection for cataloging and creating datasets.