Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Rivero L.Encyclopedia of database technologies and applications.2006

.pdf
Скачиваний:
11
Добавлен:
23.08.2013
Размер:
23.5 Mб
Скачать

240

Extraction-Transformation-Loading Processes

Alkis Simitsis

National Technical University of Athens, Greece

Panos Vassiliadis

University of Ioannina, Greece

Timos Sellis

National Technical University of Athens, Greece

INTRODUCTION

A data warehouse (DW) is a collection of technologies aimed at enabling the knowledge worker (executive, manager, analyst, etc.) to make better and faster decisions. The architecture of a DW exhibits various layers of data in which data from one layer are derived from data of the lower layer (see Figure 1). The operational databases, also called data sources, form the starting layer. They may consist of structured data stored in open database and legacy systems, or even in files. The central layer of the architecture is the global DW. The global DW keeps a historical record of data that result from the transformation, integration, and aggregation of detailed data found in the data sources. An auxiliary area of volatile data, data staging area (DSA) is employed for the purpose of data transformation, reconciliation, and cleaning. The next layer of data involves client warehouses, which contain highly aggregated data, directly derived from the global warehouse. There are various kinds of local warehouses, such as data mart or on-line analytical processing (OLAP) databases, which may use relational database systems or specific multidimensional data structures. The whole environment is described in terms of its components, metadata, and processes in a central metadata repository, located at the DW site.

In order to facilitate and manage the DW operational processes, specialized tools are available in the market,

Figure 1. A data warehouse environment

METADATA

REPORTING

REPOSITORY

OLAP TOOLS

 

 

 

 

 

 

SOURCES

 

DW

 

 

ETL

DSA

DATA

 

MARTS

 

 

under the general title extraction-transformation-loading (ETL) tools. ETL tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a DW (see Figure 2). The functionality of these tools includes

the extraction of relevant information at the source side;

the transportation of this information to the DSA;

the transformation (i.e., customization and integration) of the information coming from multiple sources into a common format;

the cleaning of the resulting data set, on the basis of database and business rules; and

the propagation and loading of the data to the DW and the refreshment of data marts.

BACKGROUND

In the past, there have been research efforts towards the design and optimization of ETL tasks. We mention three research prototypes: (a) AJAX, (b) Potter’s Wheel, and

(c) ARKTOS II. The first two prototypes are based on algebras, which we find mostly tailored for the case of homogenizing Web data; the latter concerns the modeling of ETL processes in a customizable and extensible manner.

Figure 2. ETL processes in detail

SOURCES

 

EXTRACT

TRANSPORT

 

 

TRANSFORM

 

 

& CLEAN

DSA

RE

 

FRE

 

 

SH

 

 

ME

 

 

NT

L

 

O

 

A

DATA

D

MARTS

DW

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Extraction-Transformation-Loading Processes

The AJAX system (Galhardas, Florescu, Sasha, & Simon, 2000) deals with typical data quality problems, such as the object identity problem, errors due to mistyping, and data inconsistencies between matching records. This tool can be used either for a single source or for integrating multiple data sources. AJAX provides a framework wherein the logic of a data cleaning program is modeled as a directed graph of data transformations that starts from some input source data. AJAX also provides a declarative language for specifying data cleaning programs, which consists of SQL statements enriched with a set of specific primitives to express mapping, matching, clustering, and merging transformations. Finally, an interactive environment is supplied to the user to resolve errors and inconsistencies that cannot be automatically handled and to support a stepwise refinement design of data cleaning programs.

Raman and Hellerstein (2001) present the Potter’s Wheel system, which is targeted to provide interactive data cleaning to its users. The system offers the possibility of performing several algebraic operations over an underlying data set, including format (application of a function), drop, copy, add a column, merge delimited columns, split a column on the basis of a regular expression or a position in a string, divide a column on the basis of a predicate (resulting in two columns, the first involving the rows satisfying the condition of the predicate and the second involving the rest), selection of rows on the basis of a condition, folding columns (where a set of attributes of a record is split into several rows), and unfolding. Optimization algorithms are also provided for the CPU usage for certain classes of operators. The general idea behind Potter’s Wheel is that users build data transformations in an iterative and interactive way; thereby, users can gradually build transformations as discrepancies are found and clean the data without writing complex programs or enduring long delays.

ARKTOS II is a coherent framework for the conceptual, logical, and physical design of ETL processes. The goal of this line of research is to facilitate, manage, and optimize the design and implementation of the ETL processes, during both the initial design and deployment stage, as

such during the continuous evolution of the DW. To this

end, Vassiliadis, Simitsis, and Skiadopoulos (2002) and E Simitsis and Vassiliadis (2003) proposed a novel concep-

tual model. Further, Simitsis, Vassiliadis, Skiadopoulos, and Sellis (2003) and Vassiliadis et al. (2004) presented a novel logical model. The proposed models, conceptual and logical, were constructed in a customizable and extensible manner so that the designer can enrich them with his own reoccurring patterns for ETL processes. Therefore, ARKTOS II offers a palette of several templates, representing frequently used ETL transformations along with their semantics and their interconnection (see Figure 3). In this way, the construction of ETL scenarios as a flow of these transformations, is facilitated. Additionally, ARKTOS II takes into account the optimization of ETL scenarios, with a main focus on the improvement of the time performance of an ETL process, and ARKTOS II tackles the problem of how the software design of an ETL scenario can be improved, without any impact on its consistency.

An extensive review of data quality problems and related literature, along with quality management methodologies can be found in Jarke, Lenzerini, Vassiliou, and Vassiliadis (2000). Rundensteiner (1999) offered a discussion on various aspects of data transformations. Sarawagi (2000) offered a similar collection of papers in the field of data, including a survey (Rahm & Do, 2000) that provides an extensive overview of the field, along with research issues and a review of some commercial tools and solutions for specific problems (e.g., Borkar, Deshmuk, & Sarawagi, 2000; Monge, 2000). In a related but different context, the IBIS tool (Calì et al., 2003) is an integration tool following the global-as-view approach to answer queries in a mediated system. Moreover, there exists a variety of ETL tools in the market. Simitsis (2004) listed the ETL tools available at the time this paper was written.

MAIN THRUST OF THE ARTICLE

In this section we briefly review the individual issues that arise in each of the phases of an ETL process, as well as the problems and constraints that concern it.

Figure 3. Typical template transformations provided by ARKTOS II

Filters

Unary transformations

Binary transformations

Selection (ó)

Push

Union (U)

Not null (NN)

Aggregation (ã)

Join (><)

Primary key violation (PK)

Projection (ð)

Diff (Ä)

Foreign key violation (FK)

Function application (f)

Update Detection (ÄUPD)

Unique value (UN)

Surrogate key assignment (SK)

 

Domain mismatch (DM)

Tuple normalization (N)

Composite transformations

Transfer operations

Tuple denormalization (DN)

Slowly changing dimension (Type 1,2,3)(SDC-1/2/3)

 

Format mismatch (FM)

Ftp (FTP)

File operations

Data type conversion (DTC)

Compress/Decompress (Z/dZ)

EBCDIC to ASCII conversion (EB2AS)

Switch (ó*)

Encrypt/Decrypt (Cr/dCr)

Sort file (Sort)

Extended union (U)

 

 

241

TEAM LinG

Global Problems and Constraints

Scalzo (2003) mentions that 90% of the problems in DW arise from the nightly batch cycles that load the data. At this period, administrators have to deal with problems such as (a) efficient data loading, and (b) concurrent job mixture and dependencies. Moreover, ETL processes have global time constraints, including the time they must be initiated and their completion deadlines. In fact, in most cases, there is a tight time window during the night that can be exploited for the refreshment of the DW, because the source system is off-line or not heavily used during this period.

Consequently, a major problem arises with the scheduling of the overall process. The administrator must find the right execution order for dependent jobs and job sets on the existing hardware for the permitted time schedule. On the other hand, if the OLTP applications cannot produce the necessary source data in time for processing before the DW comes online, the information in the DW will be out of date. Still, because DWs are used for strategic purposes, this problem can be afforded because long-term reporting and planning is not severely affected by this type of failure.

Extraction and Transportation

During the ETL process, one of the first tasks that must be performed is the extraction of relevant information that must be further propagated to the warehouse. To minimize the overall processing time, this involves only a fraction of the source data that has changed since the previous execution of the ETL process and mainly concerns the newly inserted, and possibly updated, records. Usually, change detection is physically performed by the comparison of two snapshots (one corresponding to the previous extraction and the other corresponding to the current one). Efficient algorithms exist for this task, such as the snapshot differential algorithms presented by Labio and GarciaMolina (1996). Another technique is log sniffing, (i.e., the scanning of the log file to reconstruct the changes performed since the last scan). In rare cases, change detection can be facilitated by the use of triggers. However, this solution is technically impossible for many of the sources that consist of legacy systems or plain flat files. In numerous other cases, where relational systems are used at the source side, the usage of triggers is also prohibitive due to the performance degradation that their usage incurs and the need to intervene in the structure of the database. Moreover, another crucial issue concerns the transportation of data after the extraction, where tasks such as FTP, encryption–decryption, compression–decompression, and so forth, can take place.

Extraction-Transformation-Loading Processes

Transformation and Cleaning

It is possible to determine typical tasks that take place during the transformation and cleaning phase of an ETL process. Rahm and Do (2000) further detail this phase in the following tasks: (a) data analysis, (b) definition of transformation workflow and mapping rules, (c) verification, (d) transformation, and (e) backflow of cleaned data.

In terms of the transformation tasks, we can categorize the problem in two main classes of problems (Lenzerini, 2002):

conflicts and problems at the schema level, and

data-level transformations (i.e., at the instance level).

The main problems with the schema level are (a) naming conflicts, where the same name is used for different objects (homonyms) or different names are used for the same object (synonyms), and (b) structural conflicts, where one must deal with different representations of the same object in different sources.

In addition, there are many variations of data-level conflicts across sources: duplicated or contradicting records, different value representations (e.g., marital status), different interpretation of the values (e.g., measurement units dollar vs. euro), different aggregation levels (e.g., sales per product vs. sales per product group), or reference to different points in time (e.g., current sales as of yesterday for source 1 vs. current sales as of last week for source 2). The list is enriched by low-level technical problems such as data type conversions, applying format masks, assigning fields to a sequence number, substituting constants, setting values to NULL or DEFAULT based on a condition, or using simple SQL operators (e.g., UPPER, TRUNC, SUBSTR).

In sequel, we present three common ETL transformations as examples: (a) semantic reconciliation and denormalization; (b) surrogate key assignment; and (c) string problems.

Semantic reconciliation and denormalization:

Frequently, DWs are highly denormalized, to answer more quickly certain queries. For example, in Figure 4, one can observe that a query on the total income of an employee in table DW.EMP can easily be computed as an addition of the attributes Salary, Tax, Bonus, whereas in the schema of the OLTP table EMP_INCOME, we should apply an aggregation operation. The transformation of the information organized in rows to the information organized in columns is called rotation or denormalization, because, frequently, the derived values (e.g., the total income) are also

242

TEAM LinG

Extraction-Transformation-Loading Processes

Figure 4. Semantic reconciliation and denormalization

Figure 5. Surrogate key assignment

 

 

E

 

 

 

 

 

 

 

 

DW.R

ID

Descr

ID

Descr

+

 

 

 

100

Coca

 

 

 

10

Coca

R1

 

110

Pepsi

 

 

 

 

20

Pepsi

 

 

120

HBH

 

 

 

 

 

 

+

 

 

 

ID

Descr

 

 

 

 

10

Pepsi

Source ID

Source

Surrogate Key

R2

 

 

 

20

HBH

10

R1

100

 

Lookup

 

 

 

 

 

20

R1

110

 

 

 

10

R2

110

 

 

 

20

R2

120

 

stored as columns, functionally dependent on other attributes. Occasionally, it is possible to apply the reverse transformation to normalize denormalized data before being loaded to the DW. Observe also, how the different tables at the source side (i.e., EMP, EMP_INCOME, Income_Lookup) are integrated into a single DW table (i.e., DW.EMP).

Surrogate keys: In a DW project, we usually replace the keys of the production systems with a uniform key, which we call a surrogate key (Kimball, Reeves, Ross, & Thornthwaite, 1998). Reasons for this replacement are performance and semantic homogeneity. Performance is affected because textual attributes are not the best candidates for indexed keys and thus need to be replaced by integer keys. More important, semantic homogeneity causes reconciliation problems because different production systems might use different keys for the same object (synonyms), or the same key for different objects (homonyms), resulting in the need for a global replacement of those values in the DW. Observe row 10, Pepsi, in table R2 of Figure 5. This row has a synonym conflict with row 20, Pepsi, in table R1 because they both represent the same real-world entity with different IDs, and it has a homonym conflict with row 10, Coca, in table R1 (over attribute ID). The production key ID is replaced by a surrogate key through a lookup table of the form Lookup(SourceID,Source,SurrogateKey). The Source column of this table allows that there can be synonyms in the different sources, which are mapped to different objects in the DW (e.g., value 10 in tables R1 and R2). At the end of this process, the DW table DW.R has globally unique, reconciled keys.

String problems: A major challenge in ETL processes is the cleaning and homogenization of string data, that is, data that stands for addresses, acronyms, names, and so forth. Usually, the approaches for the solution of this problem include the application of regular expressions (e.g., using Perl programs) for the normalization of string data to a set of reference values. Figure 6 illustrates an example of this problem.

Loading

The final loading of the DW has its own technical challenges. Simple SQL commands are not sufficient because the open-loop-fetch technique, in which records are inserted one by one, is extremely slow for the vast volume of data to be loaded in the warehouse. An extra problem is the simultaneous usage of the rollback segments and log files during the loading process. Turning them off might include some risk in the case of a loading failure. So far, the best technique is the usage of the batch loading tools offered by most RDBMSs that avoids these problems.

Another problem is discriminating between new and existing data at loading time. This problem arises when a

Figure 6. String problems

243

TEAM LinG

set of records has to be classified to (a) the new rows that need to be appended to the warehouse and (b) rows that already exist in the DW but whose value has changed and must be updated (e.g., with an UPDATE command). Modern ETL tools already provide mechanisms towards this problem, mostly through language predicates.

Techniques that facilitate the loading task involve the simultaneous creation of tables and their respective indexes, the minimization of interprocess wait states, and the maximization of concurrent CPU usage.

FUTURE TRENDS

In a recent study, Giga Information Group (2002) reported that the ETL market reached a size of $667 million for year 2001; still, the growth rate reached a rather low 11% (compared with a 60% growth rate for year 2000). The study also proposed future technological challenges and forecasts that involve the integration of ETL with (a) XML adapters; (b) EAI (Enterprise Application Integration) tools (e.g., MQ-Series); (c) customized data quality tools; and (d) the move towards parallel processing of the ETL workflows.

There are several issues that are technologically open and that present interesting topics of research for the future. Active ETL, the need to refresh the warehouse with the freshest data possible (ideally, online) is a hot topic that has recently appeared (Adzic & Fiore, 2003). The need for optimal algorithms, for the individual transformations and for the whole process, is also an interesting topic of research. Finally, we anticipate that the extension of the ETL mechanisms for nontraditional data such as XML/HTML, spatial and biomedical data will also be a hot topic of research.

CONCLUSION

ETL tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a DW. In all the phases of an ETL process (extraction and exportation, transformation and cleaning, and loading), individual issues arise and, along with the problems and constraints that concern the overall ETL process, make its lifecycle very troublesome. Although, state of the art in the field of both research and commercial ETL tools includes some pieces of remarkable progress, a lot of work remains to be done before we claim that this problem is resolved. To this end, recent studies account this subject a research challenge and pinpoint the main topics of future work.

Extraction-Transformation-Loading Processes

REFERENCES

Adzic, J., & Fiore, V. (2003). Data warehouse population platform. Proceedings of 5th International Workshop on the Design and Management of Data Warehouses (DMDW ’03), Berlin, Germany.

Borkar, V., Deshmuk, K., & Sarawagi, S. (2000). Automatically extracting structure from free text addresses. Bulletin of the Technical Committee on Data Engineering, 23(4).

Calì, A., Calvanese, D., De Giacomo, G., Lenzerini, M., Naggar, P., & Vernacotola, F. (2003). IBIS: Semantic data integration at work. Proceedings of the 15th CAiSE: Vol. 2681. Lecture Notes in Computer Science (pp. 79-94). Springer.

Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000). Ajax: An extensible data cleaning tool. In Proceedingsof the ACM SIGMOD international conference on the management of data (p. 590). Dallas: TX.

Giga Information Group. (2002). Market overview update: ETL (Tech. Rep. No. RPA-032002-00021).

Inmon, W.-H. (1996). Building the data warehouse (2nd ed.). New York: John Wiley & Sons.

Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. (Eds.). (2000). Fundamentals of data warehouses. Springer-Verlag.

Kimball, R., Reeves, L., Ross, M., & Thornthwaite, W. (1998). The data warehouse lifecycle toolkit: Expert methods for designing, developing, and deploying data warehouses. John Wiley & Sons.

Labio, W., & Garcia-Molina, H. (1996). Efficient snapshot differential algorithms for data warehousing. Proceedings of the 22nd international conference on very large data bases (VLDB) (pp. 63-74). Bombay, India.

Lenzerini, M. (2002). Data integration: A theoretical perspective. Proceedings of the 21st symposium on principles of database systems (PODS) (pp. 233-246). Wisconsin.

Monge, A. (2000). Matching algorithms within a duplicate detection system. Bulletin of the Technical Committee on Data Engineering, 23(4).

Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering, 23(4).

Raman, V., & Hellerstein, J. (2001). Potter’s Wheel: An interactive data cleaning system. Proceedings of the 27th international conference on very large data bases (VLDB) (pp. 381-390). Rome, Italy.

244

TEAM LinG

Extraction-Transformation-Loading Processes

Rundensteiner, E. (Ed.). (1999). Data transformations [Special issue]. Bulletin of the Technical Committee on Data Engineering, 22(1).

Sarawagi, S. (2000). Data cleaning [Special issue]. Bulletin of the Technical Committee on Data Engineering, 23(4).

Scalzo, B. (2003). Oracle DBA guide to data warehousing and star schemas. Prentice Hall.

Simitsis, A. (2004). List of ETL tools. Retrieved May 10, 2004, from http://www.dbnet.ece.ntua.gr/~asimi/ ETLTools.htm

Simitsis, A., Vassiliadis, P., Skiadopoulos, S., & Sellis, T. (2003). Modeling of ETL processes using graphs. Proceedings of the 2nd Hellenic Data Management Symposium (HDMS03), Athens, Greece.

Simitsis, A., & Vassiliadis, P. (2003). A methodology for the conceptual modeling of ETL processes. Proceedings of the decision systems engineering (DSE ’03), Velden, Austria.

Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., & Skiadopoulos, S. (2004). A generic and customizable framework for the design of ETL scenarios. Information Systems Journal.

Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). Conceptual modeling for ETL processes. Proceedings of the 5th data warehousing and OLAP (DOLAP ’02),

McLean, VA.

KEY TERMS

E

Data Mart: A logical subset of the complete data warehouse. We often view the data mart as the restriction of the data warehouse to a single business process or to a group of related business processes targeted towards a particular business group.

Data Staging Area (DSA): An auxiliary area of volatile data employed for the purpose of data transformation, reconciliation, and cleaning before the final loading of the data warehouse.

Data Warehouse (DW): A data warehouse is a sub- ject-oriented, integrated, time-variant, nonvolatile collection of data used to support the strategic decision-making processes for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data (Inmon, 1996).

ETL: Extract, transform, and load (ETL) are data warehousing functions that involves extracting data from outside sources, transforming it to fit business needs, and ultimately loading it into the data warehouse. ETL is an important part of data warehousing; it is the way data actually gets loaded into the warehouse.

On-Line Analytical Processing (OLAP): The general activity of querying and presenting text and number data from data warehouses as well as a specifically dimensional style of querying and presenting that is exemplified by a number of OLAP vendors.

Source System: An operational system of record whose function is to capture the transactions of the business. A source system is often called a legacy system in a mainframe environment.

Target System: The physical machine on which the data warehouse is organized and stored for direct querying by end users, report writers, and other applications.

245

TEAM LinG

246

Free Software and Open Source Databases

Hugo J. Curti

Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina

INTRODUCTION

Free software and open source databases have become a serious alternative to commercial databases. This article states common points and differences between the open source and free software movements and briefly describes the most relevant licenses and database projects.

The biggest free software project in the world is the GNU1 project, whose main sponsor is the FREE SOFTWARE FOUNDATION. There are other important free software sponsors around the world (Association for Free Software, 2004; Free Software Foundation Europe, 2004; Ray, Hudson, & Greve, 2004). Summarizing, it can be said that free software is a big paradigm that includes technical, philosophical, moral, and social bases. There is a lot of information about free software provided by the Free Software Foundation (2004a).

BACKGROUND: FREE SOFTWARE AND OPEN SOURCE

Free Software

There are many ambiguities involved in the definition of free software. The word “free” may be used to say “free of charge” or to say “free speech.” It is the same word, but the meaning is different. When it is used in “Free Software,” free means freedom. Freedom to run a program. Freedom to study how a program works and adapt it to specific needs. Freedom to redistribute a program (even charging money for that). Freedom to improve a program and redistribute the improvements. A pure free software distribution license should grant all those rights to users and should also forbid any project using the software to revoke those rights. When the license states that “every project that uses this software in any form should be released as free software,” it is said to be a “copylefted” license. There are moral reasons to argue that every free software should be copylefted, although many weak copylefted or even non-copylefted free software licenses exist.

There are also many technical advantages of free software over closed or proprietary software. As many developers can study the sources, most bugs are discovered at the beginning of free software’s life cycle, tending to stabilize it faster. Free software projects are generally based on relatively simple modules that are combined to form a complex project, facilitating in this way software reuse and drastically reducing the programming efforts. Somehow it can be said that those problems that are solved synergically in the free software world are solved, on the other hand, by means of man hours and money in the proprietary software world.

Open Source Initiative

The Open Source Initiative is a marketing program for free software (Debian Project, 2004; Open Source Initiative, 2004b; Perens, 1997).

It is based on more pragmatic reasons than free software: It emphasizes the technical benefits of providing the source code (full peer review and rapid evolution) over the ideological reasons. It was thought as way to let free software reach the corporate world. The Open Source Initiative provides a certification mark that may be used in any software distributed under the terms of an approved license.

Free/Open Source Licenses

Only the most relevant licenses are listed here. The full list of Open Source licenses is published by the Open Source Initiative (2004a), and the full list of free software licenses is published by the Free Software Foundation (2004b).

The GNU General Public License (GPL)

Allows verbatim distribution of the source code.

Allows distribution of binaries as long as the source code used to build it is available somehow.

Allows modifications of the software to generate a new project, provided that the differences are clearly stated.

Forbids the imposition of further restrictions in the distribution license of derived projects (“copyleft”).

Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.

TEAM LinG

Free Software and Open Source Databases

Provides no warranties of any kind, including the implied warranties of merchantability and fitness for a particular purpose.

The New BSD License and the X11 (MIT)

License

These are very simple licenses. Everything is permitted except to sue the authors of the code (Wheeler, 2004). There is also an old version of the BSD license, which has a clause that enforces advertising Berkeley University. This clause does not exist in the new BSD license.

The Mozilla Public License

The Mozilla Public license is a non-strong copylefted license that has some complex restrictions that make it incompatible with the GNU GPL. That is, a module covered by the GPL and a module covered by the MPL cannot legally be linked together (Free Software Foundation, 2004b).

FREE/OPEN SOURCE DATABASE PROJECTS

PostgreSQL

POSTGRESQL (2004) is an object-relational database management system (ORDBMS). It is based on a research project at the University of California called POSTGRES. POSTGRES pioneered many concepts that only became available in some commercial database systems much later. Because of its extensibility, POSTGRESQL is also very well suited for database research.

Features

Table inheritance.

Complex queries support.

Foreign keys and referential integrity support.

Transactional integrity.

Multiversion concurrency control.

New procedural languages for stored procedures and triggers may be added by the user.

New data types, functions, operators, aggregate functions, and index methods may be added by the user.

Support for updatable views and query rewrite rules.

Replication servers are under developmen. F (erServer, 2004; Slony I, 2004).

License

POSTGRESQL is released under the terms of the BSD public license.

MySQL

The MySQL software (2004) delivers a very fast, multithreaded, multiuser, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software.

Features

Written in C and C++.

Uses GNU Automake, Autoconf, and Libtool (the standard open source tools), enhancing the portability of the code.

Fully multi-threaded.

Transactional and non-transactional storage engine.

The server is available as a separate program for network client/server operation or as a library for embedded stand-alone operation.

APIs for C, C++, Eiffel, Java, Perl, PHP, Python, Ruby, and Tcl are available.

Partial stored procedures and foreign key support.

License

MYSQL has a dual license approach. Users may choose to use it under the terms of the GNU GPL or purchase a commercial license from MYSQL AB. The reference manual is not covered under the GNU GPL (an authorization from MYSQL AB is required to sell printed copies).

Firebird

Firebird is an open source relational database that evolved from the commercial database INTERBASE by INPRISE, now known as BORLAND SOFTWARE CORP. (Firebird, 2004). In August 2000 this company released a beta version of Interbase 6.0 open source.

247

TEAM LinG

After that, a core of developers formed the FIREBIRD project based on it. Now the FIREBIRD database is almost 100% compatible with INTERBASE, although both are independent.

Features

Most ANSI SQL 92 features implemented.

Runs on Windows, Linux, and a variety of Unix platforms.

“Classic” architecture for single-threaded environments and “super server” for multi-threaded environments.

Multiversion concurrency control.

Various SQL dialects for backwards compatibility reasons.

Native language support for stored procedures and triggers.

Updatable views.

Integrity check constraints.

License

Firebird is released under the terms of the INTERBASE PUBLIC LICENSE, which is based on the MOZILLA PUBLIC LICENSE.

FUTURE TRENDS AND CONCLUSION

Free software and open source (FS/OS) databases are becoming serious alternatives to big and expensive commercial databases. Many features present in commercial databases today were developed in FS/OS databases. As FS/OS projects evolve, databases get more complex and more complete.

Free software and open source are two definitions of a new paradigm for software development. There is a big gap between the corporate world and this paradigm, but the gap seems to be getting smaller and smaller in recent years (Yoder, 2002). According to the author’s personal opinion, chances are that in the near future this new paradigm will generate new business models and will replace the old ones.

Free Software and Open Source Databases

REFERENCES

Association for Free Software. (2004). AFFS FAQs. Retrieved from http://www.affs.org.uk/faq.html

Debian Project. (2004, April). The Debian social contract, Version 1.0. Retrieved from http:// www.debian.org/social_contract.html

erServer. (2004). erServer main site. http:// www.erserver.com

Firebird. (2004, October). Firebird project’s main site. http://firebird.sourceforge.net

Free Software Foundation. (2004a, October). The GNU operating system. Retrieved from http://www.gnu.org

Free Software Foundation. (2004b, October). Various licenses and comments about them. Retrieved from http://www.gnu.org/licenses/license-list.html

Free Software Foundation Europe. (2004, October). Free software foundation Europe main site. http:// www.fsfeurope.org

MySQL. (2004, October). MySQL developer’s zone. http:/ /www.mysql.org

Open Source Initiative. (2004a). The approved licenses. Retrieved from http://www.opensource.org/licenses

Open Source Initiative. (2004b). The Open Source Definition. Retrieved from http://www.open source.org/docs/ definition.php

Perens, B. (1997, June). The Debian free software guidelines (DFSG). Retrieved from http://www.deb ian.org/ social_contract.html#guidelines

PostgreSQL. (2004, October). PostgreSQL project’s main site. http://www.postgresql.org

Ray, M. J., Hudson, A., & Greve, G. (2004). Constitution for AFFS. Retrieved from http://www.affs.org.uk/docu- ments/affs-const.txt

Slony I. (2004). The slony i project main site. http:// www.slony.info

Wheeler, D. A. (2004, September). Why open source software / free software (oss/fs)? Look at the numbers! Retrieved from http://www.dwheeler.com/ oss_fs_why.html

Yoder, M. (2002, October). Our open source / free software future: It’s just a matter of time. Retrieved from http://yoderdev.com/oss-future.html

248

TEAM LinG

Free Software and Open Source Databases

KEY TERMS

Closed (or Proprietary) Software: Software that is owned by an individual or a company (usually the one that develops it) and is protected from unauthorized use by patents, trademarks, or copyrights. Such software is often sold or leased to other individuals or organizations, usually with strict restrictions regarding its use, modification, and further distribution. In many cases the source code is kept secret.

Free Software Foundation (FSF): Nonprofit organization founded in 1985 to support the GNU project and to promote education about free software.

GNU Project: Project launched by Richard Stallman with the goal of creating a complete free operating system: the GNU system. GNU is a recursive acronym for “GNU’s Not Unix.”

Multi-Threaded System: System where many concurrent sequences of executing instructions (called threads) may coexist and share data.

Multiversion Concurrency Control: Method used in relational databases to achieve serializability of transac- F tions. MCC ensures a transaction never has to wait for a database object by maintaining several versions of an object. Each version would have a write time-stamp, and

it would let a transaction read the most recent version of an object which precedes the transaction’s time stamp.

Open Source Initiative (OSI): An organization dedicated to managing and promoting the Open Source Definition for the good of the community.

Software License: Contract between a producer and a user of computer software, sometimes called an enduser license agreement (EULA), that specifies the perimeters of the permission granted by the owner to the user.

ENDNOTE

1

Recursive acronym meaning GNU is Not Unix

 

249

TEAM LinG

Соседние файлы в предмете Электротехника