"The difficulty lies not so much in developing new ideas as in escaping from old ones."
John Maynard Keynes
The following survey provides a list of open source software (OSS) tools used in business intelligence (BI) and data warehousing systems. The tool selection criteria was based on the frequency and currency of the releases and on whether the product has released a stable build which could be used in a production environment. We only present those solutions which have had updates within the past two years. Our study looked at BI tools in the following categories: i) databases; ii) extract/transform/load (ETL); iii) master data management; iv) BI reporting tools; and v) data mining. In the case of an open source software bundle that overlaps categories, we divide the software bundle into its separate parts for ease of categorization.
In the following tables, we define "Fully Open Source" to mean software solutions that offer all of their source code to the public under an open source license. "Y" is used in this column to indicate the software falls into this category. Solutions which provide software versions containing additional features or functionality without offering that source code freely to the public are listed as "-" in this column. Any features mentioned in the tables refer only to the fully open source version of the software. Organizations that do not provide commercial support for their product, but who point their customers to third-party support solutions, receive a "Y*" in the Commercial Support column of the tables, to indicate that commercial support is available. Our analysis was based on a review of the OSS tools market as of July, 2009.
Database Management Systems
There are a wide variety of database management systems (DBMS) available for a data warehouse solution. Standard relational databases, such as MySQL and PostgreSQL, have dominated the open source database market for some time. More recently, new open source database systems have emerged that target the unique requirements of data warehousing. Column-oriented databases such as Infobright, LucidDB, and MonetDB increase data reading performance by storing data in columns rather than rows. This feature is important for a data warehouse where read-optimization takes precedence over write-optimization, and where typical queries look at the attributes of a column for a set of records rather than the components of a row.
There have been several developments in the area of distributed-computing such as cloud computing. The distributed database solutions presented below are built on top of the open source Java framework for distributed systems: Hadoop. Hadoop-based databases are best suited for data warehouses which number in the hundreds of millions of rows or greater. These distributed data warehouse systems have typically been used for web analytics of high-traffic websites. Finally, although there are several open source embeddable database solutions, none of them are scalable to the extent required by most data warehouse systems, and for that reason, were excluded from this study.
Table 1 summarizes the database types, commercial support availability, and last release version of open source database management systems.
Table 1: Open Source Database Management Systems
Extract, Transform, and Load Tools
ETL is the most time consuming component of the data warehouse development lifecycle. Typically for this reason alone, it is important to have a good tool which: i) allows the user to interact with as many different source systems as possible; ii) maintains an acceptable level of usability and performance; and iii) provides metadata regarding the transformations undertaken against the data. There are several open source ETL tools available today that provide good performance, good error handling, and metadata management. These tools are summarized in Table 2.
Table 2: Open Source ETL Tools
Master Data Management
The datasets in a BI environment are often sourced from many different source systems ranging from spreadsheets to enterprise resource planning (ERP) applications to third party data. The data may have the same key dimension data repeated in more than one source. Master data management (MDM) tools are used to identify duplicate records and consolidate them into one unified record which is then stored in the data warehouse.
The same customer could be viewed as a potential lead by the marketing department and as an existing client by the billing department. Potentially, this customer could receive service and marketing material to encourage the purchase of a service already being consumed. Data warehouse systems consolidate these two customer records into one through the use of MDM tools, ensuring that the customer information and status is consistent across the business system.
Table 3 summarizes the currently available open source MDM tools.
Table 3: Open Source MDM Tools
BI Reporting and Analytics Tools
BI reporting and analytics tools are what business users typically use to access data warehouse data, so it is imperative that the tool be intuitive and user-friendly. There are several open source solutions which deliver a variety of features targeting different groups of users such as report developers, decision makers, and non-specialized users. It is important to explore as many tools as possible before deciding on which software package is right for your organization. In Table 4, we focus on four common features found in BI reporting software: standard reporting, ad hoc reporting, OLAP reporting, and dashboards.
Standard reporting tools can address a variety of different business reporting requirements such as monthly sales by store or by product. Ad-hoc reporting allows non-technical business users to develop 'on-the-fly' reports without having any knowledge of structured query language (SQL) or the underlying database structure. OLAP reporting refers to online analytical processing, a technique specifically designed to improve performance of BI queries by pre-aggregating data and common analytical tasks, and by modeling the data multi-dimensionally. The benefit of having OLAP functionality is the ease with which users can drill-down and roll-up into different levels of data hierarchies such as by country, to by province, to by region, to by store. Dashboards are user-defined visual summaries of key business performance indicators, allowing the user to interact with the charts and graphs to drill-down into specific business areas.
Table 4 summarizes open source reporting and analytics tools, including which features are available for each solution.
Table 4: Open Source Reporting and Analytics Tools
Data Mining Tools
Data mining techniques are used in BI systems to determine patterns and relationships amongst data attributes and potential outcomes. Data warehouses are the most common source of data for mining BI because they are typically the most comprehensive and extensive data source available within the company, and are well suited for time-series analysis. Data mining has the potential of discovering correlations that are hidden within the sea of data. For example, by using data mining techniques, a company with sufficient customer data could determine which contacts, based on their attributes, such as age, location, and marital status, are more likely to respond to a direct mail marketing campaign. In turn, this could help the company reduce costs by targeting only those customers who are the most likely to respond to the advertisement.
Table 5 summarizes the currently available open source data mining tools.
Table 5: Open Source Data Mining Tools
Up until recently, the BI market has been dominated by large propietary software vendors. The maturation period for open source BI solutions has passed, and many of these companies and communities are now directly competing for market shares in several BI categories. BI consumers are now given a variety of enterprise-ready open source BI solutions to choose from to meet their business requirements, making their software decisions that much more compelling. If the present ubiquity of open source databases can be used as an indicator, then the future looks very bright for open source BI solution providers, and for BI users the world over.