Profiling guidelines for agent based connections

Data Integrity Suite

Product
Spatial_Analytics
Data_Integration
Data_Enrichment
Data_Governance
Precisely_Data_Integrity_Suite
geo_addressing_1
Data_Observability
Data_Quality
dis_core_foundation
Services
Spatial Analytics
Data Integration
Data Enrichment
Data Governance
Geo Addressing
Data Observability
Data Quality
Core Foundation
ft:title
Data Integrity Suite
ft:locale
en-US
PublicationType
pt_product_guide
copyrightfirst
2000
copyrightlast
2026

When you use an agent based connection to profile large datasets, follow these best practices and system requirements to ensure optimal performance and stability.

Category Details
Storage requirements
  • Profiling large datasets requires temporary disk usage by the Spark engine.
  • It is recommended to allocate at least 1 TB of local storage on the Agent virtual machine to manage temporary data spills during profiling thereby preventing any failure during job execution.
Database table optimization
  • To enhance profiling performance, source database tables should be well-optimized.
  • Recommended table optimizations include:
    • Primary Keys (PK)
    • Unique Keys (UK)
    • Appropriate Indexes
Memory requirements
  • The profiling engine demands significant memory, particularly with large and wide datasets.
  • Memory allocation should be adjusted based on the volume of input tables to prevent out-of-memory errors.
Parallel profiling jobs
  • The capacity for running parallel profiling jobs is contingent on the host machine's configuration (CPU and RAM) and the memory allocated per profiling pipeline.
  • For example, a machine with 64 GB RAM and 16 CPU cores can run a maximum of 2 parallel profiling jobs concurrently, with each job allocated 24 GB of memory.
Performance tuning with CPU
  • Increasing the CPU allocation for the profiling pipeline engine can significantly enhance profiling performance.
  • Note: This improvement is most effective when the source tables are equipped with Primary Keys (PK), Unique Keys (UK), and indexes.
  • For instance, assigning one CPU core allows the system to execute a single task at a time, while two cores enable parallel task execution, reducing the processing time. However, the CPU cores allocated should match the host machine's actual cores. Over provisioning or assigning more cores than available can cause resource contention, leading to unexpected behavior or performance issues during execution.