Data Modeling Data Warehousing Data Quality Oracle Database Erwin BI FAQ: Conceptual Data Model

Showing posts with label Conceptual Data Model. Show all posts

Thursday, 26 May 2011

Datastage Best Practices

Introduction

This document contains the Data Stage Best practises and recommendations which could be used to improve the quality of data stage jobs. This document would further be enhanced to include the specific Data Stage problems and there troubleshooting.

Recommendations

• Data Stage Version
• DataStage Job Naming Conventions
• DataStage Job Stage and Link Naming Conventions
• DataStage Job Descriptions
• DataStage Job Complexity
• DataStage Job Design
• Error and/or Reject Handling
• Process Exception Handling
• Standards
• Development Guidelines
• Component Usage
• DataStage Data Types
• Partitinong Data
• Collecting Data
• Sorting
• Stage Specific Guidelines

Data Stage Version

Use DataStage Enterprise edition v 8.1 over Server edition for any future production development. Enterprise edition has a more robust set of Job stages, provides better performance through parallel processing, and more future flexibility through its scalability.

DataStage Job Naming Conventions

For DataStage job development, a standard naming convention should be utilized for all job names. This could include various components including the type of job, source of data, type of data, category, etc.
Eg: jbCd_Aerodrome_Type_Code_RDS
jbDtl_Prty_Address_CAAMS

The job naming conventions for the conversion jobs don’t really need to change at this point. These jobs are likely executed only during the initial conversion and not used again after that. If any of these are to become part of the production process, then changing the job name to a standard format would be preferred.

DataStage Job Stage and Link Naming Conventions

It is recommended that a standard naming convention be adopted and adhered to for all jobs and sequences. Although the stated items above are only minor variations and oversights, there should be consistency and completeness in naming stages and links.

Again, the conversion jobs are only to be executed once and addressing any inconsistencies does not make sense at this point. Future development should adhere to the defined standard.

DataStage Job Descriptions

Descriptions should be included in every DataStage job and Job stage. This is facilitated for the Job in the Job Properties window, allowing both a Short and Long Job Descriptions. It is also found in the Stage Properties for each stage. Descriptions allow other developers, and those reviewing the code, to better understand the purpose of the job, how it accomplishes it, and any special processing. The more complex the Job or Stage is, the more detail that should be included. Even simple self explanatory Jobs or Stages require some sort of description.

DataStage Job Complexity

Production jobs on the other hand shouldn’t be this complex. They should complete a specific task with a minimal number of Stages. Typically data processing is broken up into Extraction, Staging, Validation, Transformation, Load Ready, and Load jobs. Each job, of each category, typically deals with one source or target table at a time, with DataSets used to pass data between jobs. The end result is that many more DataStage jobs are required to complete the same process,

DataStage Job Design

Continue this design approach for any new development where there are similarities between jobs. It is always quicker to develop a new job if a similar job can be leveraged as a starting point. In addition, there is an opportunity to create Shared Containers with common code that can be reused across a number of jobs. This simplifies the development of each similar job, and only requires changes/maintenance of one version of the common code (Shared Container). Any new development should consider job designs that allow Shared Containers to utilized for common coding elements.

Error Handling

Implement error handling to manage records that cannot be processed for various reasons. This includes records with bad data, missing data (not null attributes), orphaned child records, missing code table entries, and other business rules required for excluding specific data or complete UOW (units of work). A reject process should also be considered if records are to be reprocessed at later date, such as when code tables get updated, or when the missing Parent records are finally processed. The staging area can be used to maintain record status so that successful and failed records can be tracked.

Process Exception Handling

Exception handling should be implemented in any production sequence job called by another sequence. Since the conversion process is likely a manual process, any failure could be dealt with manually as required.

On the other hand, in a production environment, dependant job sequences should not be executed if their predecessor job sequences do not complete successfully. The called sequence should include the Exception Handler and Terminator stages to prevent further processing when a job fails. This allows the problem to be addressed and the sequence restarted with fewer issues.

Standards

It is important to establish and follow consistent standards in:

Directory structures for installation and application support directories

Naming conventions, especially for DataStage Project categories, stage names, and links.

All DataStage jobs should be documented with the Short Description field, as well as Annotation fields.

It is the DataStage developer’s responsibility to make personal backups of their work on their local workstation, using DataStage's DSX export capability. This can also be used for integration with source code control systems.

Development Guidelines

Modular development techniques should be used to maximize re-use of DataStage jobs and components:

Job parameterization allows a single job design to process similar logic instead of creating multiple copies of the same job. The Multiple-Instance job property allows multiple invocations of the same job to run simultaneously.

_ A set of standard job parameters should be used in DataStage jobs for source and target database parameters (DSN, user, password, etc.) and directories where files are stored. To ease re-use, these standard parameters and settings should be made part of a Designer Job Parameter Sets.
_ Create a standard directory structure outside of the DataStage project directory for source and target files, intermediate work files, and so forth.
_ Where possible, create re-usable components such as parallel shared containers to encapsulate frequently-used logic.
_ DataStage Template jobs should be created
– Standard parameters such as source and target file paths, and database login properties
– Environment variables and their default settings
– Annotation blocks
_ Job Parameters should always be used for file paths, file names, database login settings.
_ Standardized Error Handling routines should be followed to capture errors and rejects.

Component Usage

The following guidelines should be followed when constructing parallel jobs in IBM InfoSphere DataStage Enterprise Edition:
_ Never use Server Edition components (BASIC Transformer, Server Shared Containers) within a parallel job. BASIC Routines are appropriate only for job control sequences.
_ Always use parallel Data Sets for intermediate storage between jobs unless that specific data also needs to be shared with other applications.
_ Use the Copy stage as a placeholder for iterative design, and to facilitate default type conversions.
_ Use the parallel Transformer stage (not the BASIC Transformer) instead of the Filter or Switch stages.
_ Use BuildOp stages only when logic cannot be implemented in the parallel Transformer.

DataStage Datatypes

The following guidelines should be followed with DataStage data types:
_ Be aware of the mapping between DataStage (SQL) data types and the internal DS/EE data types. If possible, import table definitions for source databases using the Orchestrate Schema Importer (orchdbutil) utility.

_ Leverage default type conversions using the Copy stage or across the Output mapping tab of other stages.

Partitioning Data

In most cases, the default partitioning method (Auto) is appropriate. With Auto partitioning, the Information Server Engine will choose the type of partitioning at runtime based on stage requirements, degree of parallelism, and source and target systems. While Auto partitioning will generally give correct results, it might not give optimized performance. As the job developer, you have visibility into requirements, and can optimize within a job and across job flows. Given the numerous options for keyless and keyed partitioning, the following objectives form a methodology for assigning partitioning:

_ Objective 1
Choose a partitioning method that gives close to an equal number of rows in each partition, while minimizing overhead. This ensures that the processing workload is evenly balanced, minimizing overall run time.

_ Objective 2
The partition method must match the business requirements and stage functional requirements, assigning related records to the same partition if required.

Any stage that processes groups of related records (generally using one or more key columns) must be partitioned using a keyed partition method.

This includes, but is not limited to: Aggregator, Change Capture, Change Apply, Join, Merge, Remove Duplicates, and Sort stages. It might also be necessary for Transformers and BuildOps that process groups of related records.

_ Objective 3
Unless partition distribution is highly skewed, minimize re-partitioning, especially in cluster or Grid configurations.
Re-partitioning data in a cluster or Grid configuration incurs the overhead of network transport.

_ Objective 4
Partition method should not be overly complex. The simplest method that meets the above objectives will generally be the most efficient and yield the best performance.

Using the above objectives as a guide, the following methodology can be applied:

a. Start with Auto partitioning (the default).

b. Specify Hash partitioning for stages that require groups of related records
as follows:

• Specify only the key column(s) that are necessary for correct grouping as long as the number of unique values is sufficient
• Use Modulus partitioning if the grouping is on a single integer key column
• Use Range partitioning if the data is highly skewed and the key column values and distribution do not change significantly over time (Range Map can be reused)
c. If grouping is not required, use Round Robin partitioning to redistribute data equally across all partitions.
• Especially useful if the input Data Set is highly skewed or sequential d. Use Same partitioning to optimize end-to-end partitioning and to minimize re-partitioning
• Be mindful that Same partitioning retains the degree of parallelism of the upstream stage
• Within a flow, examine up-stream partitioning and sort order and attempt to preserve for down-stream processing. This may require re-examining key column usage within stages and re-ordering stages within a flow (if business requirements permit).

Across jobs, persistent Data Sets can be used to retain the partitioning and sort order. This is particularly useful if downstream jobs are run with the same degree of parallelism (configuration file) and require the same partition and sort order.

Collecting Data

Given the options for collecting data into a sequential stream, the following guidelines form a methodology for choosing the appropriate collector type:

1. When output order does not matter, use Auto partitioning (the default).

2. Consider how the input Data Set has been sorted:

– When the input Data Set has been sorted in parallel, use Sort Merge collector to produce a single, globally sorted stream of rows.
– When the input Data Set has been sorted in parallel and Range partitioned, the Ordered collector might be more efficient.
3. Use a Round Robin collector to reconstruct rows in input order for round-robin partitioned input Data Sets, as long as the Data Set has not been re-partitioned or reduced.

Sorting Data

Apply the following methodology when sorting in an IBM InfoSphere DataStage Enterprise Edition data flow:
1. Start with a link sort.
2. Specify only necessary key column(s).
3. Do not use Stable Sort unless needed.
4. Use a stand-alone Sort stage instead of a Link sort for options that are not available on a Link sort:
– The “Restrict Memory Usage” option should be included here. If you want more memory available for the sort, you can only set that via the Sort Stage — not on a sort link. The environment variable
$APT_TSORT_STRESS_BLOCKSIZE can also be used to set sort memory usage (in MB) per partition.
– Sort Key Mode, Create Cluster Key Change Column, Create Key Change Column, Output Statistics.
– Always specify “DataStage” Sort Utility for standalone Sort stages.
– Use the “Sort Key Mode=Don’t Sort (Previously Sorted)” to resort a sub-grouping of a previously-sorted input Data Set.

5. Be aware of automatically-inserted sorts:
– Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish
required sort order.
6. Minimize the use of sorts within a job flow.
7. To generate a single, sequential ordered result set, use a parallel Sort and a
Sort Merge collector.

Stage Specific Guidelines

Transformer
Take precautions when using expressions or derivations on nullable columns within the parallel Transformer:
– Always convert nullable columns to in-band values before using them in an expression or derivation.
– Always place a reject link on a parallel Transformer to capture / audit possible rejects.

Lookup
It is most appropriate when reference data is small enough to fit into available shared memory. If the Data Sets are larger than available memory resources, use the Join or Merge stage.
Limit the use of database Sparse Lookups to scenarios where the number of input rows is significantly smaller (for example 1:100 or more) than the number of reference rows, or when exception processing.

Join
Be particularly careful to observe the nullability properties for input links to any form of Outer Join. Even if the source data is not nullable, the non-key columns must be defined as nullable in the Join stage input in order to identify unmatched records.

Aggregators
Use Hash method Aggregators only when the number of distinct key column values is small. A Sort method Aggregator should be used when the number of distinct key values is large or unknown.

Database Stages
The following guidelines apply to database stages:
– Where possible, use the Connector stages or native parallel database stages for maximum performance and scalability.
– The ODBC Connector and ODBC Enterprise stages should only be used when a native parallel stage is not available for the given source or target database.
– When using Oracle, DB2, or Informix databases, use Orchestrate Schema Importer (orchdbutil) to properly import design metadata.
– Take care to observe the data type mappings.

Thursday, 28 January 2010

Physical Data Model Image

Logical Data Model Image

Conceptual Data Model Image

Friday, 8 January 2010

Partitioning in Oracle. What? Why? When? Who? Where? How?

Partitioning enables tables and indexes or index-organized tables to be subdivided into smaller manageable pieces and these each small piece is called a "partition". From an "Application Development" perspective, there is no difference between a partitioned and a non-partitioned table. The application need not be modified to access a partitioned table if that application was initially written on a non partitioned tables.

So now you know partitioning in oracle now the only thing that yo u need to know is little bit of syntax and that’s it, and you are a partitioning guru.

Oracle introduced partitioning with 8.0. With this version only, " Range Partitioning" was supported. I will come to details later about what that means. Then with Oracle 8i " Hash and Composite Partitioning" was also introduced and with 9i " List Partitioning", it was introduced with lots of other features with each upgrade. Each method of partitioning has its own advantages and disadvantages and the decision which one to use will depend on the data and type of application. Also one can MODIFY , RENAME, MOVE, ADD, DROP, TRUNCATE, SPLIT partitions. We will go thru the details now.

Advantages of using Partition’s in Table

1. Smaller and more manageable pieces of data ( Partitions )
2. Reduced recovery time
3. Failure impact is less
4. import / export can be done at the " Partition Level".
5. Faster access of data
6. Partitions work independent of the other partitions.
7. Very easy to use

Types of Partitioning Methods

1. RANGE Partitioning

This type of partitioning creates partitions based on the " Range of Column" values. Each partition is defined by a " Partition Bound" (non inclusive ) that basically limits the scope of partition. Most commonly used values for " Range Partition" is the Date field in a table. Lets say we have a table SAMPLE_ORDERS and it has a field ORDER_DATE. Also, lets say we have 5 years of history in this table. Then, we can create partitions by date for, lets say, every quarter.
So Every Quarter Data becomes a partition in the SAMPLE_ORDER table. The first partition will be the one with the lowest bound and the last one will be the Partition with the highest bound. So if we have a query that want to look at the Data of first quarter of 1999 then instead of going through the complete data it will directly go to the Partition of first quarter 1999.

This is example of the syntax needed for creating a RANGE PARTITION.

CREATE TABLE SAMPLE_ORDERS
(ORDER_NUMBER NUMBER,
ORDER_DATE DATE,
CUST_NUM NUMBER,
TOTAL_PRICE NUMBER,
TOTAL_TAX NUMBER,
TOTAL_SHIPPING NUMBER)
PARTITION BY RANGE(ORDER_DATE)
(
PARTITION SO99Q1 VALUES LESS THAN TO_DATE(‘01-APR-1999’, ‘DD-MON-YYYY’),
PARTITION SO99Q2 VALUES LESS THAN TO_DATE(‘01-JUL-1999’, ‘DD-MON-YYYY’),
PARTITION SO99Q3 VALUES LESS THAN TO_DATE(‘01-OCT-1999’, ‘DD-MON-YYYY’),
PARTITION SO99Q4 VALUES LESS THAN TO_DATE(‘01-JAN-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q1 VALUES LESS THAN TO_DATE(‘01-APR-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q2 VALUES LESS THAN TO_DATE(‘01-JUL-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q3 VALUES LESS THAN TO_DATE(‘01-OCT-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q4 VALUES LESS THAN TO_DATE(‘01-JAN-2001’, ‘DD-MON-YYYY’)
)
;

the above example basically created 8 partitions on the SAMPLE_ORDERS Table all these partitions correspond to one quarter. Partition SO99Q1 will contain the orders for only first quarter of 1999.

2. HASH Partitioning

Under this type of partitioning the records in a table, are partitions based of a Hash value found in the value of the column, that is used for partitioning. " Hash Partitioning" does not have any logical meaning to the partitions as do the range partitioning. Lets take one example.

CREATE TABLE SAMPLE_ORDERS
(ORDER_NUMBER NUMBER,
ORDER_DATE DATE,
CUST_NUM NUMBER,
TOTAL_PRICE NUMBER,
TOTAL_TAX NUMBER,
TOTAL_SHIPPING NUMBER,
ORDER_ZIP_CODE)
PARTITION BY HASH (ORDER_ZIP_CODE)
(PARTITION P1_ZIP TABLESPACE TS01,
PARTITION P2_ZIP TABLESPACE TS02,
PARTITION P3_ZIP TABLESPACE TS03,
PARTITION P4_ZIP TABLESPACE TS04)
ENABLE ROW MOVEMENT;

The above example creates four hash partitions based on the zip codes from where the orders were placed.
3. List Partitioning ( Only with 9i)

Under this type of partitioning the records in a table are partitioned based on the List of values for a table with say communities column as a defining key the partitions can be made based on that say in a table we have communities like ‘Government’ , ‘Asian’ , ‘Employees’ , ‘American’, ‘European’ then a List Partition can be created for individual or a group of communities lets say ‘American-partition’ will have all the records having the community as ‘American’

Lets take one example. In fact, we will modify the same example.

CREATE TABLE SAMPLE_ORDERS
(ORDER_NUMBER NUMBER,
ORDER_DATE DATE,
CUST_NUM NUMBER,
TOTAL_PRICE NUMBER,
TOTAL_TAX NUMBER,
TOTAL_SHIPPING NUMBER,
SHIP_TO_ZIP_CODE,
SHIP_TO_STATE)
PARTITION BY LIST (SHIP_TO_STATE)
(PARTITION SHIP_TO_ARIZONA VALUES (‘AZ’) TABLESPACE TS01,
PARTITION SHIP_TO_CALIFORNIA VALUES (‘CA’) TABLESPACE TS02,
PARTITION SHIP_TO_ILLINOIS VALUES (‘IL’) TABLESPACE TS03,
PARTITION SHIP_TO_MASACHUSETTES VALUES (‘MA’) TABLESPACE TS04,
PARTITION SHIP_TO_MICHIGAN VALUES (‘MI’) TABLESPACE TS05)
ENABLE ROW MOVEMENT;

The above example creates List partition based on the SHIP_TO_STATE each partition allocated to different table spaces.

4. Composite Range-Hash Partitioning

This is basically a combination of range and hash partitions. So basically, the first step is that the data is divided using the range partition and then each range partitioned data is further subdivided into a hash partition using hash key values. All sub partitions, together, represent a logical subset of the data.

Lets modify the above example again:

CREATE TABLE SAMPLE_ORDERS
(ORDER_NUMBER NUMBER,
ORDER_DATE DATE,
CUST_NUM NUMBER,
CUST_NAME VARCAHR2,
TOTAL_PRICE NUMBER,
TOTAL_TAX NUMBER,
TOTAL_SHIPPING NUMBER,
SHIP_TO_ZIP_CODE,
SHIP_TO_STATE)
TABLESPACE USERS
PARTITION BY RANGE (ORDER_DATE)
SUBPARTITION BY HASH(CUST_NAME)
SUBPARTITION TEMPLATE(
(SUBPARTITION SHIP_TO_ARIZONA VALUES (‘AZ’) TABLESPACE TS01,
SUBPARTITION SHIP_TO_CALIFORNIA VALUES (‘CA’) TABLESPACE TS02,
SUBPARTITION SHIP_TO_ILLINOIS VALUES (‘IL’) TABLESPACE TS03,
SUBPARTITION SHIP_TO_NORTHEAST VALUES (‘MA’, ‘NY’, ‘NJ’) TABLESPACE TS04,
SUBPARTITION SHIP_TO_MICHIGAN VALUES (‘MI’) TABLESPACE TS05)
(
PARTITION SO99Q1 VALUES LESS THAN TO_DATE(‘01-APR-1999’, ‘DD-MON-YYYY’),
PARTITION SO99Q2 VALUES LESS THAN TO_DATE(‘01-JUL-1999’, ‘DD-MON-YYYY’),
PARTITION SO99Q3 VALUES LESS THAN TO_DATE(‘01-OCT-1999’, ‘DD-MON-YYYY’),
PARTITION SO99Q4 VALUES LESS THAN TO_DATE(‘01-JAN-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q1 VALUES LESS THAN TO_DATE(‘01-APR-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q2 VALUES LESS THAN TO_DATE(‘01-JUL-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q3 VALUES LESS THAN TO_DATE(‘01-OCT-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q4 VALUES LESS THAN TO_DATE(‘01-JAN-2001’, ‘DD-MON-YYYY’)
)
ENABLE ROW MOVEMENT;

The above example shows that each range partition has been further sub-partitioned into smaller partitions based on the list value specified. SHIP_TO_ARIZONA is a sub-partition by a List value AZ. This partition will be present in the main partitions by range SO99Q1 etc.
5. Composite Range-List Partitioning ( Only with 9i)

This is also a combination of Range and List Partitions, basically first the data is divided using the Range partition and then each Range partitioned data is further subdivided into List partitions using List key values. Each sub partitions individually represents logical subset of the data not like composite Range-Hash Partition.

Index organized tables can be partitioned using Range or Hash Partitions

Lets modify the above partition once more.

CREATE TABLE SAMPLE_ORDERS
(ORDER_NUMBER NUMBER,
ORDER_DATE DATE,
CUST_NUM NUMBER,
CUST_NAME VARCAHR2,
TOTAL_PRICE NUMBER,
TOTAL_TAX NUMBER,
TOTAL_SHIPPING NUMBER,
SHIP_TO_ZIP_CODE,
SHIP_TO_STATE)
TABLESPACE USERS
PARTITION BY RANGE (ORDER_DATE)
SUBPARTITION BY LIST(SHIP_TO_STATE)
SUBPARTITION TEMPLATE(
SUBPARTITION SP1 TABLESPACE TS01,
SUBPARTITION SP2 TABLESPACE TS02,
SUBPARTITION SP3 TABLESPACE TS03,
SUBPARTITION SP4 TABLESPACE TS04,
SUBPARTITION SP5 TABLESPACE TS05)
(
PARTITION SO99Q1 VALUES LESS THAN TO_DATE(‘01-APR-1999’, ‘DD-MON-YYYY’),
PARTITION SO99Q2 VALUES LESS THAN TO_DATE(‘01-JUL-1999’, ‘DD-MON-YYYY’),
PARTITION SO99Q3 VALUES LESS THAN TO_DATE(‘01-OCT-1999’, ‘DD-MON-YYYY’),
PARTITION SO99Q4 VALUES LESS THAN TO_DATE(‘01-JAN-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q1 VALUES LESS THAN TO_DATE(‘01-APR-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q2 VALUES LESS THAN TO_DATE(‘01-JUL-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q3 VALUES LESS THAN TO_DATE(‘01-OCT-2000’, ‘DD-MON-YYYY’),
PARTITION SO00Q4 VALUES LESS THAN TO_DATE(‘01-JAN-2001’, ‘DD-MON-YYYY’)
)
ENABLE ROW MOVEMENT;

Thursday, 7 January 2010

Erwin Data Modelling Data Warehousing Business Intelligence Oracle Database Dimensional Modeling Questions and Answers Part 4

Q) Erwin Tutorial

All Fusion Erwin Data Modeler commonly known as Erwin, is a powerful and leading data modeling tool from Computer Associates. Computer Associates delivers several softwares for enterprise management, storage management solutions, security solutions, application life cycle management, data management and business intelligence.
Erwin makes database creation very simple by generating the DDL(sql) scripts from a data model by using its Forward Engineering technique or Erwin can be used to create data models from the existing database by using its Reverse Engineering technique.
Erwin workplace consists of the following main areas:
•Logical: In this view, data model represents business requirements like entities, attributes etc.

•Physical: In this view, data model represents physical structures like tables, columns, datatypes etc.

•Modelmart : Many users can work with a same data model concurrently.

Q)What can be done with Erwin?

1.Logical, Physical and dimensional data models can be created.
2.Data Models can be created from existing systems(rdbms, dbms, files etc.).
3.Different versions of a data model can be compared.
4.Data model and database can be compared.
5.SQl scripts can be generated to create databases from data model.
6.Reports can be generated in different file formats like .html, .rtf, and .txt.
7.Data models can be opened and saved in several different file types like .er1, .ert, .bpx, .xml, .ers, .sql, .cmt, .df, .dbf, and .mdb files.
8.By using ModelMart, concurrent users can work on the same data model.
In order to create data models in Erwin, you need to have this All Fusion Erwin Data Modeler installed in your system. If you have installed Modelmart, then more than one user can work on the same model.

Q)What is Data Modeling Development Cycle?

Gathering Business Requirements - First Phase: Data Modelers have to interact with business analysts to get the functional requirements and with end users to find out the reporting needs.

Conceptual Data Modeling(CDM) - Second Phase:
This data model includes all major entities, relationships and it will not contain much detail about attributes and is often used in the INITIAL PLANNING PHASE.

Logical Data Modeling(LDM) - Third Phase:
This is the actual implementation of a conceptual model in a logical data model. A logical data model is the version of the model that represents all of the business requirements of an organization.

Physical Data Modeling(PDM) - Fourth Phase:
This is a complete model that includes all required tables, columns, relationship, database properties for the physical implementation of the database.

Database - Fifth Phase:
DBAs instruct the data modeling tool to create SQL code from physical data model. Then the SQL code is executed in server to create databases.

Q)Standardization Needs | Modeling data:

Several data modelers may work on the different subject areas of a data model and all data modelers should use the same naming convention, writing definitions and business rules.
Nowadays, business to business transactions(B2B) are quite common, and standardization helps in understanding the business in a better way. Inconsistency across column names and definition would create a chaos across the business.
For example, when a data warehouse is designed, it may get data from several source systems and each source may have its own names, data types etc. These anomalies can be eliminated if a proper standardization is maintained across the organization.
Table Names Standardization:
Giving a full name to the tables, will give an idea about data what it is about. Generally, do not abbreviate the table names; however this may differ according to organization’s standards. If the table name’s length exceeds the database standards, then try to abbreviate the table names. Some general guidelines are listed below that may be used as a prefix or suffix for the table.
Examples:
Lookup – LKP - Used for Code, Type tables by which a fact table can be directly accessed.
e.g. Credit Card Type Lookup – CREDIT_CARD_TYPE_LKP
Fact – FCT - Used for transaction tables:
e.g. Credit Card Fact - CREDIT_CARD_FCT
Cross Reference - XREF – Tables that resolves many to many relationships.
e.g. Credit Card Member XREF – CREDIT_CARD_MEMBER_XREF
History – HIST - Tables the stores history.
e.g. Credit Card Retired History – CREDIT_CARD_RETIRED_HIST
Statistics – STAT - Tables that store statistical information.
e.g. Credit Card Web Statistics – CREDIT_CARD_WEB_STAT
Column Names Standardization:
Some general guidelines are listed below that may be used as a prefix or suffix for the column.
Examples:
Key – Key System generated surrogate key.
e.g. Credit Card Key – CRDT_CARD_KEY
Identifier – ID - Character column that is used as an identifier.
e.g. Credit Card Identifier – CRDT_CARD_ID
Code – CD - Numeric or alphanumeric column that is used as an identifying attribute.
e.g. State Code – ST_CD
Description – DESC - Description for a code, identifier or a key.
e.g. State Description – ST_DESC
Indicator – IND – to denote indicator columns.
e.g. Gender Indicator – GNDR_IND
Database Parameters Standardization:
Some general guidelines are listed below that may be used for other physical parameters.
Examples:
Index – Index – IDX – for index names.
e.g. Credit Card Fact IDX01 – CRDT_CARD_FCT_IDX01
Primary Key – PK – for Primary key constraint names.
e.g. CREDIT Card Fact PK01- CRDT-CARD_FCT_PK01
Alternate Keys – AK – for Alternate key names.
e.g. Credit Card Fact AK01 – CRDT_CARD_FCT_AK01
Foreign Keys – FK – for Foreign key constraint names.
e.g. Credit Card Fact FK01 – CRDT_CARD_FCT_FK01

Q)Steps to create a Data Model

These are the general guidelines to create a standard data model and in real time, a data model may not be created in the same sequential manner as shown below. Based on the enterprise’s requirements, some of the steps may be excluded or included in addition to these.
Sometimes, data modeler may be asked to develop a data model based on the existing database. In that situation, the data modeler has to reverse engineer the database and create a data model.
1» Get Business requirements.
2» Create High Level Conceptual Data Model.
3» Create Logical Data Model.
4» Select target DBMS where data modeling tool creates the physical schema.
5» Create standard abbreviation document according to business standard.
6» Create domain.
7» Create Entity and add definitions.
8» Create attribute and add definitions.
9» Based on the analysis, try to create surrogate keys, super types and sub types.
10» Assign datatype to attribute. If a domain is already present then the attribute should be attached to the domain.
11» Create primary or unique keys to attribute.
12» Create check constraint or default to attribute.
13» Create unique index or bitmap index to attribute.
14» Create foreign key relationship between entities.
15» Create Physical Data Model.
15» Add database properties to physical data model.
16» Create SQL Scripts from Physical Data Model and forward that to DBA.
17» Maintain Logical & Physical Data Model.
18» For each release (version of the data model), try to compare the present version with the previous version of the data model. Similarly, try to compare the data model with the database to find out the differences.
19» Create a change log document for differences between the current version and previous version of the data model.

Q)Data Modeler Role

Business Requirement Analysis:
» Interact with Business Analysts to get the functional requirements.
» Interact with end users and find out the reporting needs.
» Conduct interviews, brain storming discussions with project team to get additional requirements.
» Gather accurate data by data analysis and functional analysis.
Development of data model:
» Create standard abbreviation document for logical, physical and dimensional data models.
» Create logical, physical and dimensional data models(data warehouse data modelling).
» Document logical, physical and dimensional data models (data warehouse data modelling).
Reports:
» Generate reports from data model.
Review:
» Review the data model with functional and technical team.
Creation of database:
» Create sql code from data model and co-ordinate with DBAs to create database.
» Check to see data models and databases are in synch.
Support & Maintenance:
» Assist developers, ETL, BI team and end users to understand the data model.
» Maintain change log for each data model.

Q)What is Conceptual Data Modeling

Conceptual data model includes all major entities and relationships and does not contain much detailed level of information about attributes and is often used in the INITIAL PLANNING PHASE.
Conceptual data model is created by gathering business requirements from various sources like business documents, discussion with functional teams, business analysts, smart management experts and end users who do the reporting on the database. Data modelers create conceptual data model and forward that model to functional team for their review.
Conceptual Data Model - Highlights
•CDM is the first step in constructing a data model in top-down approach and is a clear and accurate visual representation of the business of an organization.
•CDM visualizes the overall structure of the database and provides high-level information about the subject areas or data structures of an organization.
•CDM discussion starts with main subject area of an organization and then all the major entities of each subject area are discussed in detail.
•CDM comprises of entity types and relationships. The relationships between the subject areas and the relationship between each entity in a subject area are drawn by symbolic notation(IDEF1X or IE). In a data model, cardinality represents the relationship between two entities. i.e. One to one relationship, or one to many relationship or many to many relationship between the entities.
•CDM contains data structures that have not been implemented in the database.
•In CDM discussion, technical as well as non-technical team projects their ideas for building a sound logical data model

Q)What is Enterprise Data Modeling?

The development of a common consistent view and understanding of data elements and their relationships across the enterprise is referred to as Enterprise Data Modeling. This type of data modeling provides access to information scattered throughout an enterprise under the control of different divisions or departments with different databases and data models.
Enterprise Data Modeling is sometimes called as global business model and the entire information about the enterprise would be captured in the form of entities.
Data Model Highlights
When a enterprise logical data model is transformed to a physical data model, super types and sub types may not be as is. i.e. the logical and physical structure of super types and sub types may be entirely different. A data modeler has to change that according to the physical and reporting requirement.
When a enterprise logical data model is transformed to a physical data model, length of table names, column names etc may exceed the maximum number of the characters allowed by the database. So a data modeler has to manually edit that and change the physical names according to database or organization’s standards.
One of the important things to note is the standardization of the data model. Since a same attribute may be present in several entities, the attribute names and data types should be standardized and a conformed dimension should be used to connect to the same attribute present in several tables.
Standard Abbreviation document is a must so that all data structure names would be consistent across the data model.

Q) Logical V/s Physical Data Model ?

When a data modeler works with the client, his title may be a logical data modeler or a physical data modeler or combination of both. A logical data modeler designs the data model to suit business requirements, creates and maintains the lookup data, compares the versions of data model, maintains change log, generate reports from data model and whereas a physical data modeler has to know about the source and target databases properties.
A physical data modeler should know the technical-know-how to create data models from existing databases and to tune the data models with referential integrity, alternate keys, indexes and how to match indexes to SQL code. It would be good if the physical data modeler knows about replication, clustering and so on.

The differences between a logical data model and physical data model is shown below.

Logical vs Physical Data Modeling

LDM :Represents business information and defines business rules

PDM: Represents the physical implementation of the model in a database.

LDM :Entity
PDM :Table

LDM:Attribute
PDM:Column

LDM:Primary Key
PDM:Primary Key Constraint

LDM:Alternate Key
PDM:Unique Constraint or Unique Index

LDM:Inversion Key Entry
PDM:Non Unique Index

LDM:Rule
PDM:Check Constraint, Default Value

LDM:Relationship
PDM:Foreign Key

LDM:Definition
PDM:Comment

Q)Relational vs Dimensional

Relational Data Modeling is used in OLTP systems which are transaction oriented and Dimensional Data Modeling is used in OLAP systems which are analytical based. In a data warehouse environment, staging area is designed on OLTP concepts, since data has to be normalized, cleansed and profiled before loaded into a data warehouse or data mart. In OLTP environment, lookups are stored as independent tables in detail whereas these independent tables are merged as a single dimension in an OLAP environment like data warehouse.

Relational vs Dimensional

RDM:Data is stored in RDBMS
DDM:Data is stored in RDBMS or Multidimensional databases

RDM:Tables are units of storage
DDM:Cubes are units of storage

RDM:Data is normalized and used for OLTP.
Optimized for OLTP processing
DDM:Data is denormalized and used in datawarehouse and data mart. Optimized for OLAP

RDM:Several tables and chains of relationships among them
DDM:Few tables and fact tables are connected to dimensional tables

RDM:Volatile(several updates) and time variant
DDM:Non volatile and time invariant

RDM:Detailed level of transactional data
DDM:Summary of bulky transactional data
(Aggregates and Measures) used in business decisions

Q)Data Warehouse & Data Mart

A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.

In addition to a relational/multidimensional database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

There are three types of data warehouses:

1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision support throughout the enterprise.

2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real entertprise data warehouse, data is refreshed in near real time and used for routine business activity.

3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function.
Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. This is most useful for users to access data since a database can be visualized as a cube of several dimensions. A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions.

Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources.

Q)Star Schema in detail

In general, an organization is started to earn money by selling a product or by providing service to the product. An organization may be at one place or may have several branches.
When we consider an example of an organization selling products throughtout the world, the main four major dimensions are product, location, time and organization. Dimension tables have been explained in detail under the section Dimensions. With this example, we will try to provide detailed explanation about STAR SCHEMA.

Q)What is Star Schema?

Star Schema is a relational database schema for representing multimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema are slicing down, performance increase and easy understanding of data.
Steps in designing Star Schema
•Identify a business process for analysis(like sales).
•Identify measures or facts (sales dollar).
•Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
•List the columns that describe each dimension.(region name, branch name, region name).
•Determine the lowest level of summary in a fact table(sales dollar).
Important aspects of Star Schema & Snow Flake Schema
•In a star schema every dimension will have a primary key.
•In a star schema, a dimension table will not have any parent table.
•Whereas in a snow flake schema, a dimension table will have one or more parent tables.
•Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
•Whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps to drill down the data from topmost hierachies to the lowermost hierarchies.

Glossary:

Hierarchy

A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level. A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals or not.

Level

A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.

Fact Table

A table in a star schema that contains facts and connected to dimensions. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). A fact table usually contains facts with the same level of aggregation.

Q)Snowflake Schema in detail

A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e dimension table hierachies are broken into simpler tables. In star schema example we had 4 dimensions like location, product, time, organization and a fact table(sales).

In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables and 1 fact table. The reason is that hierarchies(category, branch, state, and month) are being broken out of the dimension tables(PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and shown separately. In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data. In few organizations, they try to normalize the dimension tables to save space. Since dimension tables hold less space, Snowflake schema approach may be avoided.

Q)ETL Tools what to learn?

With the help of ETL tools, we can create powerful target Data Warehouses without much difficulty. Following are the various options that we have to know and learn in order to use ETL tools.
Software:
» How to install ETL tool on server/client?

Working with an ETL Tool:

» How to work with various options like designer, mapping, workflow, scheduling etc.,?
» How to work with sources like DBMS, relational source databases, files, ERPs etc., and import the source definitions?
» How to import data from data modeling tools, applications etc.,?
» How to work with targets like DBMS, relational source databases, files, ERPs etc., and import the source definitions?
» How to create target definitions?
» How to create mappings between source definitions and target definitions?
» How to create transformations?
» How to cleanse the source data?
» How to create a dimension, slowly changing dimensions, cube etc.,?
» How to create and monitor workflows?
» How to configure, monitor and run debugger?
» How to view and generate metadata reports?

Data Modeling Data Warehousing Data Quality Oracle Database Erwin BI FAQ