Google Dataproc

Data Integrity Suite

Product
Spatial_Analytics
Data_Integration
Data_Enrichment
Data_Governance
Precisely_Data_Integrity_Suite
geo_addressing_1
Data_Observability
Data_Quality
dis_core_foundation
Services
Spatial Analytics
Data Integration
Data Enrichment
Data Governance
Geo Addressing
Data Observability
Data Quality
Core Foundation
ft:title
Data Integrity Suite
ft:locale
en-US
PublicationType
pt_product_guide
copyrightfirst
2000
copyrightlast
2025

Use the Dataproc pipeline engine to run pipelines with data on Google Cloud Platform.

  • Pipeline engine name: Enter a meaningful name for the pipeline engine.
  • Type: Specifies Google Dataproc as the datasource type for this pipeline engine.
  • Datasource: Lists the available datasources for the selected datasource type.
  • Connection: Lists the available connections for the selected datasource.
  • Google cloud storage bucket:
Specify the Cloud bucket that the engine uses to store ephemeral cluster and jobs data.
Note: The Dataproc connection and Dataproc cluster's service account must have read and write access to this bucket.
  • Worker node count: Specify the number of worker nodes for processing pipelines.

Advanced Settings

  • Auto-scaling Policy: Specifies the name of the auto-scaling policy that you want to use for automating cluster resource management. For more information, see enable autoscaling.
  • Service account:
Specifies account used by services and applications running on a Compute Engine virtual machine (VM) instance to interact with other Google Cloud APIs. If service account is specified, the Dataproc cluster uses the custom service account for Dataproc data handling operations, instead of the default VM service account. For more information, see service accounts.
Note: For either the default or custom service account, the service account must have the required data access permissions for the input and output datasets used in the pipelines.
  • Spark properties: This option allows you to specify Spark properties to determine compute, memory, and disk resources to allocate to Dataproc batch workloads.
  • Click Add Property to add a key-value pair.
    • Key: Specifies the Spark property, such as spark.driver.cores.
    • Value: Specifies a valid value for the property, such as 8. For more information, see Spark Properties in the Dataproc Serverless documentation.