This section outlines recommended strategies and settings for optimizing Databricks pool configurations, focusing on cost efficiency, scalability, and reliable job execution in data profiling and processing scenarios.
| Scenario | Configuration | Explanation |
|---|---|---|
| Idle instance auto termination | Set to 2 minutes | Ensures idle clusters terminate quickly, minimizing compute costs. |
| Auto scaling local storage | Enable auto-scaling local storage |
During periods of high workload, additional disk space may be required if all data cannot be retained in memory. Enabling this option ensures that sufficient disk resources are available, thereby preventing failures that could result from exceeding the default allocation. For more information, refer to Autoscaling of Instance Storage. |
| Instance type selection |
On-demand instances should be preferred for better reliability. However, spot instances can also be chosen depending on the requirement. For jobs that prioritize cost savings over reliability, populate pools with spot instances to reduce costs. AWS Instance type: r6id.xlarge ( Specifications: 4 vCPU, 32 GB RAM) Azure Use the equivalent VM type with similar CPU/RAM characteristics |
Reliable for long-running profiling jobs; recommended based on internal baselines.(Failures-reduced). This option provides extra disk space when workloads exceed memory and default disk allocation, preventing job failures. It's ideal for long-running profiling jobs or large datasets, ensuring reliable performance and uninterrupted processing, as recommended by internal best practices. For more information, refer to Pool Considerations. |
| Min Idle instances | Set to 0 (default) | Prevents unnecessary costs when no jobs are running. |
| Max capacity in Databricks pool |
Set according to quota constraints. The "Max capacity" in Databricks pool configuration and "Max Nodes" in the Data Integrity Suite pipeline engine determines the number of parallel jobs or pipelines (i.e., tables) that can be profiled in parallel Any jobs exceeding the limit are queued. |
Determines parallel job/pipeline capacity;
excess jobs are queued. Refer to Manage Databricks quota limitation for more details. |
| Pipeline engine configuration |
Set the auto scale (min and max nodes) based on the data size. Autoscale: 1–10 nodes (for data up to 499 GB). Adjust according to quota constraints (e.g., >50 cores) |
Supports scaling based on data size and job concurrency; review periodically as data volumes change. |