Use the Dataproc pipeline engine to run pipelines with data on Google Cloud Platform.
- Pipeline engine name: Enter a meaningful name for the pipeline engine.
- Type: Specifies Google Dataproc as the datasource type for this pipeline engine.
- Datasource: Lists the available datasources for the selected datasource type.
- Connection: Lists the available connections for the selected datasource.
- Google cloud storage bucket:
Specify the Cloud bucket that the engine uses to store ephemeral cluster and jobs data.
Note: The Dataproc connection and Dataproc cluster's service account must have read and write access to this bucket.
- Worker node count: Specify the number of worker nodes for processing pipelines.
- Worker node type: Enter the machine type for the worker nodes. For more information, see supported machine types
Advanced Settings
- Auto-scaling Policy: Specifies the name of the auto-scaling policy that you want to use for automating cluster resource management. For more information, see enable autoscaling.
- Service account:
Note: For either the default or custom service account, the service account must have the required data access permissions for the input and output datasets used in the pipelines.
- Spark properties: This option allows you to specify Spark properties to determine compute, memory, and disk resources to allocate to Dataproc batch workloads.
- Click Add Property to add a key-value pair.
- Key: Specifies the Spark property, such as
spark.driver.cores. - Value: Specifies a valid value for the property, such as
8. For more information, see Spark Properties in the Dataproc Serverless documentation.
- Key: Specifies the Spark property, such as