Best practices for configuring Databricks pools

Data Integrity Suite

Product
Spatial_Analytics
Data_Integration
Data_Enrichment
Data_Governance
Precisely_Data_Integrity_Suite
geo_addressing_1
Data_Observability
Data_Quality
dis_core_foundation
Services
Spatial Analytics
Data Integration
Data Enrichment
Data Governance
Geo Addressing
Data Observability
Data Quality
Core Foundation
ft:title
Data Integrity Suite
ft:locale
en-US
PublicationType
pt_product_guide
copyrightfirst
2000
copyrightlast
2025

This section outlines recommended strategies and settings for optimizing Databricks pool configurations, focusing on cost efficiency, scalability, and reliable job execution in data profiling and processing scenarios.

Scenario Configuration Explanation
Idle instance auto termination Set to 2 minutes Ensures idle clusters terminate quickly, minimizing compute costs.
Auto scaling local storage Enable auto-scaling local storage

During periods of high workload, additional disk space may be required if all data cannot be retained in memory. Enabling this option ensures that sufficient disk resources are available, thereby preventing failures that could result from exceeding the default allocation.

For more information, refer to Autoscaling of Instance Storage.

Instance type selection

On-demand instances should be preferred for better reliability. However, spot instances can also be chosen depending on the requirement.

For jobs that prioritize cost savings over reliability, populate pools with spot instances to reduce costs.

AWS

Instance type: r6id.xlarge (

Specifications: 4 vCPU, 32 GB RAM)

Azure

Use the equivalent VM type with similar CPU/RAM characteristics

Reliable for long-running profiling jobs; recommended based on internal baselines.(Failures-reduced).

This option provides extra disk space when workloads exceed memory and default disk allocation, preventing job failures. It's ideal for long-running profiling jobs or large datasets, ensuring reliable performance and uninterrupted processing, as recommended by internal best practices.

For more information, refer to Pool Considerations.

Min Idle instances Set to 0 (default) Prevents unnecessary costs when no jobs are running.
Max capacity in Databricks pool

Set according to quota constraints.

The "Max capacity" in Databricks pool configuration and "Max Nodes" in the Data Integrity Suite pipeline engine determines the number of parallel jobs or pipelines (i.e., tables) that can be profiled in parallel Any jobs exceeding the limit are queued.

Determines parallel job/pipeline capacity; excess jobs are queued.

Refer to Manage Databricks quota limitation for more details.

Pipeline engine configuration

Set the auto scale (min and max nodes) based on the data size.

Autoscale: 1–10 nodes (for data up to 499 GB).

Adjust according to quota constraints (e.g., >50 cores)

Supports scaling based on data size and job concurrency; review periodically as data volumes change.