Monday, 22 July 2013

Testing - Data Warehouse Concepts

     Data Warehousing
Concepts
  

Data Warehousing

Is an approach/process of creating central or multiple data stores with integrated, cleansed, aggregated and historical data to support strategic decision making in an enterprise.

Data Warehouse

Data warehouse is a physical database called centralized data store containing integrated aggregated/summarized and historical business data. It is a physical database having dimensions and facts which will be related in the form of schemas like Star and Snowflake
Characteristics of DWH
DWH is   Subject oriented, time variant, non-volatile integrated set of data
·         Subject-oriented: All relevant data about a subject is gathered and stored as a single set in a consistent format
Ex: sales data ware house is built for sales related data analysis
·         ­Time-variant: Long-term storage of data, it store historical data so the data keeps on increasing over the period of time.
ex:5-10 years of data
·         Non-volatile: data in DW is read-only. No deletion operation is performed. The data is static, one version of the truth regardless of when the question is asked.
·         Integrated : Data from various homogeneous or heterogeneous will be stored with consistent naming conventions, measurements, encoding structures, and physical attributes

DWH Architecture

DWH architecture is roadmap for building data ware house. There is no fixed architecture





DWH Architecture with Staging Area and Data marts


Components of DWH Architecture:
·         Data Sources (OLTP’s)
·         Data Integration (ETL)
·         Staging Area
·         Operational Data Store (ODS)
·         Data Warehouse/Data Mart
·         Metadata
·         Business Intelligence

Databases can be classified as...
·         Online Transactional Processing (OLTP) :

o    Database stores day to day transactions
o    Contain detail data
o    Insertions and deletions happens frequently
o    Highly normalized by design
o    Cannot support for strategic decision
Examples: online trading or application filing systems, ATM transactions 

·         Online Analytical Processing (OLAP)
o    Stores the data multi-dimensionally
o    Highly optimized for quicker data retrieval
o    Referred to as Decision Support System
o    Contains summarized data at various grains to support business analytics and statistical analysis
Types Of OLAP’: Following are the important types of OLAP’s
MOLAP : Multidimensional OLAP, with a popular acronym of MOLAP, is widely regarded as the classic form of OLAP. One of the major distinctions of MOLAP against a ROLAP tool is that data are pre-summarized and are stored in an optimized format in a multidimensional cube, instead of in a relational database. In this type of model, data are structured into proprietary formats in accordance with a client’s reporting requirements with the calculations pre-generated on the cubes.
ROLAP: Considered the fastest growing OLAP technology style, ROLAP or “Relational” OLAP systems work primarily from the data that resides in a relational database, where the base data and dimension tables are stored as relational tables.
This model permits multidimensional analysis of data as this enables users to perform a function equivalent to that of the traditional OLAP slicing and dicing feature. This is achieved thorough use of any SQL reporting tool to extract or ‘query’ data directly from the data warehouse. Wherein specifying a ‘Where clause’ equals performing a certain slice and dice action.
One advantage of ROLAP over the other styles of OLAP analytic tools is that it is deemed to be more scalable in handling huge amounts of data. ROLAP sits on top of relational databases therefore enabling it to leverage several functionalities that a relational database is capable of.
HOLAP: HOLAP is the product of the attempt to incorporate the best features of MOLAP and ROLAP into a single architecture. This tool tried to bridge the technology gap of both products by enabling access or use to both multidimensional database (MDDB) and Relational Database Management System (RDBMS) data stores. HOLAP systems stores larger quantities of detailed data in the relational tables while the aggregations are stored in the pre-calculated cubes. HOLAP also has the capacity to “drill through” from the cube down to the relational tables for delineated data.
Some of the advantages of this system are better scalability, quick data processing and flexibility in accessing of data sources
DOLAP: Desktop OLAP, or “DOLAP” is based on the idea that a user can download a section of the data from the database or source, and work with that dataset locally, or on their desktop. DOLAP is easier to deploy and has a cheaper cost but comes with a very limited functionality in comparison with other OLAP applications





OLTP Vs DWH
OLTP
Data Warehouse
application oriented
subject oriented
multiple diverse sources
Integrated
Updateable
Nonvolatile
real-time, current
time-variant and contain single version of truth
Capture business transaction as it happens
Navigate through the data and identify what to do
Designed to reduce redundant data
Designed to increase analysis efficiency
Mission critical applications
Non-mission critical applications
Automation to operate the business
Query and Analysis drive the business
Data Management
Data Consolidation
Two dimensional
Multi-dimensional
Granular data
Granular as well as aggregated data
Normalized data
Usually de-normalized data
End users are operations, sales staff
End users are business analysts, executives, senior management
Predictable transaction Volume
Unpredictable ad-hoc query volume



What does Data Warehouse contain?
·         Dimension Table: A dimension is a structure contains attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis
·         Fact Table: Fact table contains the measures or fact (e.g qty sold) and foreign key. A fact table contains either detail-level facts or facts(Measures or Metrics) that have been aggregated. Fact tables that contain aggregated facts are often called summary tables.A fact table usually contains facts with the same level of aggregation
Type Of Facts

o    Additive - Additive facts can be aggregated by simple arithmetical addition across various levels(columns).Example: Quantity, Revenue, invoice amount etc
o    Non Additive: Non additive facts cannot be aggregated across various levels. Example: Age of an employee or product.
o    Semi Additive: these facts can be aggregated some extent, but not across all levels. Examples Bank Account balances

·         Business Data: Data that represents a business is called Business data. like sales, Accounting, marketing etc
·         Meta data : Its data about data
Data Mart: A subset of a data warehouse that focuses on one or more specific subject areas. The data usually is extracted from the data warehouse and further demoralized and indexed to support intense usage by targeted customers.

DSS – Decision support system: It’s a database contains a data which support for making decisions is called Decision support system. Database(s), warehouse(s), and/or mart(s) in Conjunction with reporting and analysis software optimized to support timely business decision making, hence we called Data warehouse is a DSS.

ODS – Operational Data Store: it a physical data base and can store near real time data for analysis. It contains partial historical data for about 6 months to 9 months.

Meta Data – Literally "data about data” More usefully, descriptions of what kind of Information is stored where, how it is encoded, how it is related to other information, Where it comes from and how it is related to your business. A hot topic right now is Standardizing Meta data across products from different vendors.

Methodology - The steps followed to guarantee repeatability of success. A good methodology is built on top of real world experience. For example, see The Hughes- Volume Methodology.

Measures/Facts: Measure or a fact is a numerical value on which we will perform analysis
Types of Dimension
·         Regular Dimension
·         Confirmed Dimension: A conformed dimension is a set of data attributes that have been physically implemented in multiple database tables using the same structure, attributes, domain values, definitions and concepts in each implementation. A conformed dimension cuts across many facts.
·          
·         Junk Dimension: (eg: Transactional Flags and codes and other attributes). A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of these attributes is usually text or various flags
·         Degenerate Dimension (derived from fact table): When the dimension attribute is stored as part of fact table, and not in a separate dimension table. Typically used for high cardinality dimensions like "transaction number"
·         Slowly changing Dimension: Data In dimension tables changes over the time. Like a address or phone number of a customer. When such changes happens, retaining history is a will decide how much history can use in data warehouse.
·         Slowly changing dimension can be used to design a dimension table with history. SCD’s are categorizing by 3 types


o    Type 1: Keeps current data in the dimension table

o    Type 2: Keeps Full History in Dimension table. For Every change in dimension column/Field, an additional row will be inserted with Identifier such as Flag, Version or Timestamp. Dimension in Type 2 will grow vertically.

o    Type 3: Keeps current and just previous history of a dimension record/row. Dimension with type 3 will grow horizontally

·         Role play Dimension: (e.g. Time of Sale, Time of delivery etc.). It’s a dimension used based on purpose. Like time dimension can play order_date dimension and also ship_date dimension. That means we can analyze the order by order date or ship date.
Types of Fact Tables
·         Fact table: Fact table contains all dimensional keys as Foreign Keys and Facts or Measures. It contains a primary Key called surrogate key. Which is a substitute to Primary key to retain history
·         Fact less Fact table: These table or Facts but will not have Measure/fact in it.
Data Modeling - Types
·         Logical
·         Physical
o    ER – Will be used in OLTPS and for Logical modeling
o    Dimensional Modeling – Used in data Warehousing
Hierarchy: Hierarchy is the path through which data can be sliced (selected) in multidimensional cubes or models. A hierarchy is an arrangement of columns. A hierarchy can link entities either directly or indirectly, and either vertically or horizontally
Type of Hierarchy:
·         Balanced hierarchy: Balanced hierarchy in dimensions contains consistent levels and depth between levels. Levels in this hierarchy maintain parent and child relationship like as shown below.



·         Unbalanced hierarchy: unbalanced hierarchy in dimensions contains consistent levels, but inconsistence depth between levels. Levels in this hierarchy maintain parent and child relationship like as shown below.



·         Ragged Hierarchy: Levels in this hierarchy cannot have consistence levels and depths.

Star Schema
In Star Schema Fact table sits in middle and surrounded by dimensions like a star.
A standard technique for designing the summary tables of a data warehouse. "Fact" tables each join to a larger number of independent "dimension “tables. The tables may be partially de-normalized for performance, but most queries will still need to join in one or more of the star tables.
Each dimension have primary key which is related to foreign key in the fact table
Star Schemas are in 2nd Normal Form. Most of the Business Intelligence applications recommend this schema
Advantage:
Since all dimensions are directly linked to fact table, less number of joins required and it fetches data rapidly and Less Complex query (due to fewer join), easy to understand
Disadvantage
Since redundant data is there, it is not easy to maintain
Snowflake Schema
The extension of the star schema. In this schema dimension tables are normalized. Dimension tables are broken into different tables.
For the example of store dimension in the star schema is stored along with address information. In Snowflake schema store table will be broken into 2 tables one is address table and another one store table
Tables are in 3 Normalized form.
Since this approach will increase the number of joins it leads to poor performance.
This schema can be used in small data ware house/data mart to save the space
Most of the Business Intelligence (BI) tools do not recommend this

Advantage:
·         Tables are normalized, it will save the space.
·         Easy to Maintain
Disadvantage
  • Poor query Performance, as it
  • Complex queries due to more joins, difficult to understand
  • Not suitable for Large Data ware houses/data marts
Star Schema vs Snowflake
Star Schema
Snowflake Schema
2nd Normal Form
3rd Normal form
Less number of Join
More Joins
Fast Query Performance
Slow Performance
Simple queries due to less joins, easy to understand
complex queries
Redundant Data
Less redundancy
Consume more space due to redundancy
Less Space
Good for large Data warehouse/datamart
Small DWH/Datamart
Most of Business Intelligence (BI) tool recommends
Not recommended






No comments:

Post a Comment