|
Storage requirements
|
- Profiling large datasets requires temporary disk usage by
the Spark engine.
- It is recommended to allocate at least 1 TB of local storage
on the Agent virtual machine to manage temporary data spills
during profiling thereby preventing any failure during job
execution.
|
|
Database table optimization
|
- To enhance profiling performance, source database tables
should be well-optimized.
- Recommended table optimizations include:
- Primary Keys (PK)
- Unique Keys (UK)
- Appropriate Indexes
|
|
Memory requirements
|
- The profiling engine demands significant memory,
particularly with large and wide datasets.
- Memory allocation should be adjusted based on the volume of
input tables to prevent out-of-memory errors.
|
|
Parallel profiling jobs
|
- The capacity for running parallel profiling jobs is
contingent on the host machine's configuration (CPU and RAM)
and the memory allocated per profiling pipeline.
- For example, a machine with 64 GB RAM and 16 CPU cores can
run a maximum of 2 parallel profiling jobs concurrently,
with each job allocated 24 GB of memory.
|
|
Performance tuning with CPU
|
- Increasing the CPU allocation for the profiling pipeline
engine can significantly enhance profiling performance.
- Note: This improvement is most effective when the source
tables are equipped with Primary Keys (PK), Unique Keys
(UK), and indexes.
- For instance, assigning one CPU core allows the system to
execute a single task at a time, while two cores enable
parallel task execution, reducing the processing time.
However, the CPU cores allocated should match the host
machine's actual cores. Over provisioning or assigning more
cores than available can cause resource contention, leading
to unexpected behavior or performance issues during
execution.
|