Header Ads

Tutorial : What is Datawarehousing - Part I

Business Intelligence:

Business Intelligence symbolizes the tools and systems which are used for making critical decisions in a business. These systems allow to congregate and evaluate the data for strategic planning. They convert the raw data into meaningful and useful information. Business Intelligence relies on data warehousing to extract the required data. Thus, a data warehouse acts as a back-end for Business Intelligence.

What is a Data Warehouse?

Data Warehouse is a storage place for data. It is used to store current and historical information.

According to Ralph Kimball, "Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model."

According to Bill Inmon, "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process."

However, the definition by Bill Inmon is extensively accepted which is explained in detail below:

Subject-Oriented: Data Warehouse concentrates on a particular subject area.
For example, a data warehouse consisting the details of the sales.

Integrated: Data warehouse consists of data that has been combined from numerous sources. The data can be obtained from mainframes, flat files, SQL Server database and various other heterogeneous sources.

Time-variant: Data warehouse stores current and historical data. The data is used for decision-making and forecasting the business trends.

Non-volatile: The data stored in a data warehouse is retained and never deleted. A different version of data is stored in the data warehouse indicating if any inserts, updates or deletes have occurred. Hence, the original data is never altered.

Why do we need a data warehouse?

Data warehouses are used extensively in the largest and most complex businesses around the world. In demanding situations, good decision making becomes critical. Significant and relevant data is required to make decisions. This is possible only with the help of a well-designed data warehouse.

Enhancing the turnaround time for analysis and reporting: Data warehouse allows business users to access critical data from a single source enabling them to take quick decisions. They need not waste time retrieving data from multiple sources.

The business executives can query the data themselves with minimal or no support from IT which in turn saves money and time.

Improved Business Intelligence: Data warehouse helps in achieving the vision for the managers and business executives. Outcomes that affect the strategy and procedures of an organization will be based on reliable facts and supported with evidence and organizational data.

Benefit of historical data: Transactional data stores data on a day to day basis or for a very short period of duration without the inclusion of historical data. In comparison, a data warehouse stores large amounts of historical data which enables the business to include time-period analysis, trend analysis, and trend forecasts.

Standardization of data: The data from heterogeneous sources are available in a single format in a data warehouse. This simplifies the readability and accessibility of data.

For example, gender is denoted as Male/ Female in Source 1 and m/f in Source 2 but in a data warehouse the gender is stored in a format which is common across all the businesses i.e. M/F.

Immense ROI (Return On Investment): Return On Investment refers to the additional revenues or reduces expenses a business will be able to realize from any project. According to a 2002 International Data Corporation (IDC) study "The Financial Impact of Business Analytics", analytics projects have been achieving a substantial impact on a business’ financial status.



What are the components of a Data warehouse?

The components of a data warehouse are depicted in the figure below:
Tutorial : What is Datawarehousing - Part I | TekkiPedia
Datawarehousing Components (Image : WideSkills)
Data Sources:
A flat file database stores data in a normal text format. Contrary to a relational database where the data is stored in the form of tables, in a flat file database the data stored does not have a folders or paths related to them. No manipulations are performed on the data. Delimiters are used in flat files to separate the data columns.

Excel spreadsheets are regularly used in data warehousing operations. They are impressive, low-priced, and flexible tolls that many decision-makers find convenient to use. Excel also provides graphing features that allow the end-user to present the required data in chart and graph formats. These formats can be easily integrated into MS Word and Power Point presentations.

Operational systems of a business contain the day to day transactions of the data at a low-level. For example, the sales data, HR data, marketing data are used as input sources for a data warehouse.

Legacy systems are the applications of the yesteryear. They mirror the requirements of a business that might be twenty to twenty five year old. They are use till date since over years these systems have captured the business knowledge and rules that are exceptionally difficult to translate to a new platform/application.

Staging Area:
There are three important parts in this. Extraction, Transformation and Loading (ETL).

The first part of the staging area is the most challenging process of extraction. Depending on how accurately the data is extracted the subsequent operations succeed or fail. The source systems might be complicated or poorly documented due to which the process becomes all the more difficult. The data may be extracted not only once but also periodically when changes occur at the source side.

The second stage is the transformation where the data is converted from one format to another. Since data often exists in different locations and formats across the enterprises, data conversion is mandatory to ensure that data from one application is comprehensible to other applications and databases.


The third stage is the loading where the extracted and transformed data is loaded into a data mart or a data warehouse depending on the business. The populated data is used for presentation applications by the end users.


Data Repository:
The data is loaded into a data warehouse in the form of facts and dimensions.


Users:
The loaded data is accessed for reporting, analysis, and mining. The reporting tools like Business Objects and Cognos are used by users to generate reports. The data is also used for predicting trends.


What is the difference between OLTP and a Data Warehouse?

Tutorial : What is Datawarehousing - Part I | TekkiPedia
OLTP and OLAP (Image : http://datawarehouse4u.info)
OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). 

OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).


The following table summarizes the major differences between OLTP and OLAP system design.



­
OLTP System
Online Transaction Processing
(Operational System)
OLAP System
Online Analytical Processing
(Data Warehouse)
Source of data
Operational data; OLTPs are the original source of the data.
Consolidation data; OLAP data comes from the various OLTP
Databases
Purpose of data
To control and run fundamental business tasks
To help with planning, problem solving, and decision making
What data
Reveals a snapshot of ongoing business processes
Multi-dimensional views of various kinds of business activities
Inserts and Updates
Short and fast inserts and updates initiated by end users
Periodic long-running batch jobs refresh the data
Queries
Relatively standardized and simple queries Returning relatively
few records
Often complex queries involving aggregations
Processing Speed

Typically very fast
Depends on the amount of data involved; batch data refreshes and
complex queries may take many hours; query speed can be improved by creating
indexes
Space Requirements
Can be relatively small if historical data is archived
Larger due to the existence of aggregation structures and
history data; requires more indexes than OLTP
Database Design
Highly normalized with many tables
Typically de-normalized with fewer tables; use of star and/or
snowflake schemas
Backup and Recovery
Backup religiously; operational data is critical to run the
business, data loss is likely to entail significant monetary loss and legal
liability
Instead of regular backups, some environments may consider
simply reloading the OLTP data as a recovery method











Powered by Blogger.