Data Modeling Data Warehousing Data Quality Oracle Database Erwin BI FAQ: May 2011

Introduction

This document contains the Data Stage Best practises and recommendations which could be used to improve the quality of data stage jobs. This document would further be enhanced to include the specific Data Stage problems and there troubleshooting.

Recommendations

• Data Stage Version
• DataStage Job Naming Conventions
• DataStage Job Stage and Link Naming Conventions
• DataStage Job Descriptions
• DataStage Job Complexity
• DataStage Job Design
• Error and/or Reject Handling
• Process Exception Handling
• Standards
• Development Guidelines
• Component Usage
• DataStage Data Types
• Partitinong Data
• Collecting Data
• Sorting
• Stage Specific Guidelines

Data Stage Version

Use DataStage Enterprise edition v 8.1 over Server edition for any future production development. Enterprise edition has a more robust set of Job stages, provides better performance through parallel processing, and more future flexibility through its scalability.

DataStage Job Naming Conventions

For DataStage job development, a standard naming convention should be utilized for all job names. This could include various components including the type of job, source of data, type of data, category, etc.
Eg: jbCd_Aerodrome_Type_Code_RDS
jbDtl_Prty_Address_CAAMS

The job naming conventions for the conversion jobs don’t really need to change at this point. These jobs are likely executed only during the initial conversion and not used again after that. If any of these are to become part of the production process, then changing the job name to a standard format would be preferred.

DataStage Job Stage and Link Naming Conventions

It is recommended that a standard naming convention be adopted and adhered to for all jobs and sequences. Although the stated items above are only minor variations and oversights, there should be consistency and completeness in naming stages and links.

Again, the conversion jobs are only to be executed once and addressing any inconsistencies does not make sense at this point. Future development should adhere to the defined standard.

DataStage Job Descriptions

Descriptions should be included in every DataStage job and Job stage. This is facilitated for the Job in the Job Properties window, allowing both a Short and Long Job Descriptions. It is also found in the Stage Properties for each stage. Descriptions allow other developers, and those reviewing the code, to better understand the purpose of the job, how it accomplishes it, and any special processing. The more complex the Job or Stage is, the more detail that should be included. Even simple self explanatory Jobs or Stages require some sort of description.

DataStage Job Complexity

Production jobs on the other hand shouldn’t be this complex. They should complete a specific task with a minimal number of Stages. Typically data processing is broken up into Extraction, Staging, Validation, Transformation, Load Ready, and Load jobs. Each job, of each category, typically deals with one source or target table at a time, with DataSets used to pass data between jobs. The end result is that many more DataStage jobs are required to complete the same process,

DataStage Job Design

Continue this design approach for any new development where there are similarities between jobs. It is always quicker to develop a new job if a similar job can be leveraged as a starting point. In addition, there is an opportunity to create Shared Containers with common code that can be reused across a number of jobs. This simplifies the development of each similar job, and only requires changes/maintenance of one version of the common code (Shared Container). Any new development should consider job designs that allow Shared Containers to utilized for common coding elements.

Error Handling

Implement error handling to manage records that cannot be processed for various reasons. This includes records with bad data, missing data (not null attributes), orphaned child records, missing code table entries, and other business rules required for excluding specific data or complete UOW (units of work). A reject process should also be considered if records are to be reprocessed at later date, such as when code tables get updated, or when the missing Parent records are finally processed. The staging area can be used to maintain record status so that successful and failed records can be tracked.

Process Exception Handling

Exception handling should be implemented in any production sequence job called by another sequence. Since the conversion process is likely a manual process, any failure could be dealt with manually as required.

On the other hand, in a production environment, dependant job sequences should not be executed if their predecessor job sequences do not complete successfully. The called sequence should include the Exception Handler and Terminator stages to prevent further processing when a job fails. This allows the problem to be addressed and the sequence restarted with fewer issues.

Standards

It is important to establish and follow consistent standards in:

Directory structures for installation and application support directories

Naming conventions, especially for DataStage Project categories, stage names, and links.

All DataStage jobs should be documented with the Short Description field, as well as Annotation fields.

It is the DataStage developer’s responsibility to make personal backups of their work on their local workstation, using DataStage's DSX export capability. This can also be used for integration with source code control systems.

Development Guidelines

Modular development techniques should be used to maximize re-use of DataStage jobs and components:

Job parameterization allows a single job design to process similar logic instead of creating multiple copies of the same job. The Multiple-Instance job property allows multiple invocations of the same job to run simultaneously.

_ A set of standard job parameters should be used in DataStage jobs for source and target database parameters (DSN, user, password, etc.) and directories where files are stored. To ease re-use, these standard parameters and settings should be made part of a Designer Job Parameter Sets.
_ Create a standard directory structure outside of the DataStage project directory for source and target files, intermediate work files, and so forth.
_ Where possible, create re-usable components such as parallel shared containers to encapsulate frequently-used logic.
_ DataStage Template jobs should be created
– Standard parameters such as source and target file paths, and database login properties
– Environment variables and their default settings
– Annotation blocks
_ Job Parameters should always be used for file paths, file names, database login settings.
_ Standardized Error Handling routines should be followed to capture errors and rejects.

Component Usage

The following guidelines should be followed when constructing parallel jobs in IBM InfoSphere DataStage Enterprise Edition:
_ Never use Server Edition components (BASIC Transformer, Server Shared Containers) within a parallel job. BASIC Routines are appropriate only for job control sequences.
_ Always use parallel Data Sets for intermediate storage between jobs unless that specific data also needs to be shared with other applications.
_ Use the Copy stage as a placeholder for iterative design, and to facilitate default type conversions.
_ Use the parallel Transformer stage (not the BASIC Transformer) instead of the Filter or Switch stages.
_ Use BuildOp stages only when logic cannot be implemented in the parallel Transformer.

DataStage Datatypes

The following guidelines should be followed with DataStage data types:
_ Be aware of the mapping between DataStage (SQL) data types and the internal DS/EE data types. If possible, import table definitions for source databases using the Orchestrate Schema Importer (orchdbutil) utility.

_ Leverage default type conversions using the Copy stage or across the Output mapping tab of other stages.

Partitioning Data

In most cases, the default partitioning method (Auto) is appropriate. With Auto partitioning, the Information Server Engine will choose the type of partitioning at runtime based on stage requirements, degree of parallelism, and source and target systems. While Auto partitioning will generally give correct results, it might not give optimized performance. As the job developer, you have visibility into requirements, and can optimize within a job and across job flows. Given the numerous options for keyless and keyed partitioning, the following objectives form a methodology for assigning partitioning:

_ Objective 1
Choose a partitioning method that gives close to an equal number of rows in each partition, while minimizing overhead. This ensures that the processing workload is evenly balanced, minimizing overall run time.

_ Objective 2
The partition method must match the business requirements and stage functional requirements, assigning related records to the same partition if required.

Any stage that processes groups of related records (generally using one or more key columns) must be partitioned using a keyed partition method.

This includes, but is not limited to: Aggregator, Change Capture, Change Apply, Join, Merge, Remove Duplicates, and Sort stages. It might also be necessary for Transformers and BuildOps that process groups of related records.

_ Objective 3
Unless partition distribution is highly skewed, minimize re-partitioning, especially in cluster or Grid configurations.
Re-partitioning data in a cluster or Grid configuration incurs the overhead of network transport.

_ Objective 4
Partition method should not be overly complex. The simplest method that meets the above objectives will generally be the most efficient and yield the best performance.

Using the above objectives as a guide, the following methodology can be applied:

a. Start with Auto partitioning (the default).

b. Specify Hash partitioning for stages that require groups of related records
as follows:

• Specify only the key column(s) that are necessary for correct grouping as long as the number of unique values is sufficient
• Use Modulus partitioning if the grouping is on a single integer key column
• Use Range partitioning if the data is highly skewed and the key column values and distribution do not change significantly over time (Range Map can be reused)
c. If grouping is not required, use Round Robin partitioning to redistribute data equally across all partitions.
• Especially useful if the input Data Set is highly skewed or sequential d. Use Same partitioning to optimize end-to-end partitioning and to minimize re-partitioning
• Be mindful that Same partitioning retains the degree of parallelism of the upstream stage
• Within a flow, examine up-stream partitioning and sort order and attempt to preserve for down-stream processing. This may require re-examining key column usage within stages and re-ordering stages within a flow (if business requirements permit).

Across jobs, persistent Data Sets can be used to retain the partitioning and sort order. This is particularly useful if downstream jobs are run with the same degree of parallelism (configuration file) and require the same partition and sort order.

Collecting Data

Given the options for collecting data into a sequential stream, the following guidelines form a methodology for choosing the appropriate collector type:

1. When output order does not matter, use Auto partitioning (the default).

2. Consider how the input Data Set has been sorted:

– When the input Data Set has been sorted in parallel, use Sort Merge collector to produce a single, globally sorted stream of rows.
– When the input Data Set has been sorted in parallel and Range partitioned, the Ordered collector might be more efficient.
3. Use a Round Robin collector to reconstruct rows in input order for round-robin partitioned input Data Sets, as long as the Data Set has not been re-partitioned or reduced.

Sorting Data

Apply the following methodology when sorting in an IBM InfoSphere DataStage Enterprise Edition data flow:
1. Start with a link sort.
2. Specify only necessary key column(s).
3. Do not use Stable Sort unless needed.
4. Use a stand-alone Sort stage instead of a Link sort for options that are not available on a Link sort:
– The “Restrict Memory Usage” option should be included here. If you want more memory available for the sort, you can only set that via the Sort Stage — not on a sort link. The environment variable
$APT_TSORT_STRESS_BLOCKSIZE can also be used to set sort memory usage (in MB) per partition.
– Sort Key Mode, Create Cluster Key Change Column, Create Key Change Column, Output Statistics.
– Always specify “DataStage” Sort Utility for standalone Sort stages.
– Use the “Sort Key Mode=Don’t Sort (Previously Sorted)” to resort a sub-grouping of a previously-sorted input Data Set.

5. Be aware of automatically-inserted sorts:
– Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish
required sort order.
6. Minimize the use of sorts within a job flow.
7. To generate a single, sequential ordered result set, use a parallel Sort and a
Sort Merge collector.

Stage Specific Guidelines

Transformer
Take precautions when using expressions or derivations on nullable columns within the parallel Transformer:
– Always convert nullable columns to in-band values before using them in an expression or derivation.
– Always place a reject link on a parallel Transformer to capture / audit possible rejects.

Lookup
It is most appropriate when reference data is small enough to fit into available shared memory. If the Data Sets are larger than available memory resources, use the Join or Merge stage.
Limit the use of database Sparse Lookups to scenarios where the number of input rows is significantly smaller (for example 1:100 or more) than the number of reference rows, or when exception processing.

Join
Be particularly careful to observe the nullability properties for input links to any form of Outer Join. Even if the source data is not nullable, the non-key columns must be defined as nullable in the Join stage input in order to identify unmatched records.

Aggregators
Use Hash method Aggregators only when the number of distinct key column values is small. A Sort method Aggregator should be used when the number of distinct key values is large or unknown.

Database Stages
The following guidelines apply to database stages:
– Where possible, use the Connector stages or native parallel database stages for maximum performance and scalability.
– The ODBC Connector and ODBC Enterprise stages should only be used when a native parallel stage is not available for the given source or target database.
– When using Oracle, DB2, or Informix databases, use Orchestrate Schema Importer (orchdbutil) to properly import design metadata.
– Take care to observe the data type mappings.

Data Modeling Data Warehousing Data Quality Oracle Database Erwin BI FAQ

Monday, 30 May 2011

MDM : What is Master Data ?

Sunday, 29 May 2011

DataStage huge logs removal resolution

Thursday, 26 May 2011

Datastage Best Practices

Blog Archive

About Me