Sample data can be utilized to add, configure, and test steps in a Data Quality pipeline. It can be uploaded from a file or generated from a cataloged dataset and used sample data when privacy constraints prevent the use of actual data to create a pipeline.
For both uploaded and generated samples, you may choose to only include those fields from the cataloged dataset that will be processed as input data by the pipeline. This can simplify assembling a pipeline for a large dataset with many different fields.
You can upload sample data to assure that a sample includes specific characteristics, such as duplicates or addressing errors. When these or other concerns do not apply, you can quickly generate sample data from the source data. Sample datasets are by default stored in encrypted format on the Precisely Cloud. Optionally, you can choose to store sample datasets in an Amazon S3 cloud storage location. Sample data stored in an S3 bucket are managed by AWS, and subscribers can configure their own encryption and deletion policies.
The procedures to create and run Data Quality pipelines are the same for both storage locations. Steps in a pipeline may create fields not included in the sample dataset. If a field created by a pipeline has the same name as a field already in the cataloged dataset, data in that field may overwrite data in the existing field.
Characteristics of a sample file to upload
- Contains representative data in a text (.txt) or comma-delimited (.csv) with fields that correspond to some or all of the fields in a mapped schema.
- Can contain up to 10 MB of data.
- Does not have to contain all of the fields from the cataloged dataset.
- Sample columns may be in any order as long as the first row contains field names exactly as they appear in the cataloged dataset. Any column name in the file that does not match a column name in the cataloged dataset creates an error that prevents the upload.
- Uploaded sample data must be in the same format as the data type in
the dataset. For example, depending on the data source, Date-Time
fields are often in ISO format
(
'1969-07-16T20:17:00'). - Fields in a text file may be delimited by the comma (,), period (.), pipe (|), semicolon (;), space ( ), or tab character.
- Text in a field that contains the delimiter may be qualified by single or double quotation characters.
- Lines breaks may use the Unix (or OS X) LF character, the Windows CR and LF characters, or the Macintosh CR character.
- The first row must contain delimited field names that match fields specified by the mapped schema.
- The delimited field names and data can be in any order.
Data format guidelines
These guidelines must be followed for data that you want to upload:
-
Must be in the same format as the data type in the dataset. For example, a date-time field is usually in ISO format, such as
DateTime ('2011-12-03T10:15:30'). -
Sample data containing a geospatial column type must include the spatial features in GeoJSON format. In Snowflake, data can be exported to either of the following formats:
- Setting the session before exporting the data:
ALTER SESSION SET GEOGRAPHY_OUTPUT_FORMAT = 'GeoJSON'(for more information see, Snowflake documentation). - Using the
ST_ASGEOJSONfunction in the SQL used to export the data:SELECT ID, Description, ST_AsGeoJSON(Geom) FROM Mytable.
- Setting the session before exporting the data:
-
Both Snowflake GEOMETRY and GEOGRAPHY columns are supported. The geometry in the export must be in the original coordinate system of the table. Although the GeoJSON specification requires the geometry to be in the WGS84 coordinate system (SRID = 4326). The Snowflake
ST_ASGEOJSONfunction and Data Integrity Suite do not impose this requirement.