"By June 2006, the project has hit the magic "100 cases finished" mark, at an exciting equal "100% legal success" mark. Every GPL infringement that we started to enforce was resolved in a legal success, either in-court or out of court."
At many points in the life of a software enterprise, determination of intellectual property (IP) cleanliness becomes critical. The value of an enterprise that develops and sells software may depend on how clean the software is from the IP perspective. This article examines various methods of ensuring software IP cleanliness and discusses some of the benefits and shortcomings of current solutions.
Source of the Problem
In any software-producing organization, software projects conceptually comprise iteratively decomposing a project into smaller and more manageable sub-projects until that point where individual sub-projects can be assigned to individuals or groups. In such a scenario, an individual software developer assigned to a software sub-project has a combination of options for sourcing the necessary software, including: (i) obtaining software from their organization's code repository, (ii) obtaining software from one or more open source repositories, (iii) obtaining software through purchase, and (iv) obtaining software through the creative process of writing it.
Depending upon the size and complexity of a software project, this scenario may be repeated many times. Also, when a sub-project is outsourced or otherwise assigned to a separate organization such as in a collaborative project, the outsourcing team goes through a similar scenario. In sum, the software employed by any typical software-producing organization may be derived from such sources as: (i) the organization's code-base; (ii) open source repositories; (iii) commercial vendors; and (iv) in-house development.
With this scenario in mind, IP integrity of the final product is very much a function of: i) individual pieces and ii) the IP practices that were defined, monitored and enforced during the development process.
We observe that the IP integrity of the software produced can be compromised in ways that include:
- The organization's code repository may have impure artifacts that are introduced into the product
- If used, the outsourcer's repository or the collaborative organization's repository may have impure artifacts that are submitted
- Open source components do not satisfy the open source policy of the organization, assuming that the organization has an open source policy, and it has been appropriately communicated
- Open source components introduced by outsourcers or collaborators may not satisfy the open source policy or may be improperly checked and verified against existing policies
- The license governing the use of certain commercial code may be improperly licensed for the application, geographic market, or the mode of deployment
In addition, a developer may make a contribution that is copyright by another firm. For example, a recent survey indicates that about 70% of developers carry code from one company to another. Anecdotal experience suggests that the real percentage may be higher.
The situation gets more complex if we consider that code repositories can become contaminated because of "generational tainting" properties of licenses such as the GPL, or more generally, the pedigree of open source code may be unknown or difficult to categorically establish.
In this article, we do not address intentional code contamination and instead focus on unintentional contamination as we believe that most code contaminations are unintentional. For any software project of sufficient size it is generally difficult to understand exactly what is in the software by the time the software is collected, integrated, tested and released. It may be very costly and time consuming to perform the necessary due diligence to identify what problems may exist.
An immediate conclusion we can draw from the preceding is the importance of employing safe software development practices. That is, we advocate a preventive approach aided by policies, education and tools.
Who is Affected
Understanding the IP pedigree of software is important and is becoming an increasingly common requirement in many enterprises that create and/or use software. In what follows, we first describe the software food chain, and then examine how it can impact the players.
The software industry chain may be described in simple terms. The chain consists of increasingly larger and more complex organizations that consume software by bringing in software from other (typically smaller) firm(s), combining the software with their own value-added functionality, and passing the result on to the next (typically larger) firm in the chain.
To show the scope of the firms that can be affected by contamination, let's use an example. The software chain in the cellular phone industry consists of:
- Small developers and independent software vendors that contribute hardware drivers or protocol stacks to the chip makers
- Chip makers that add their own content and pass it on to the cell phone vendor
- The cell phone vendor adds applications and graphical user interfaces, again obtained from other players and internal development teams, and passes these on to the operator
- The operator may add its own customized content such as splash screens or operator-specific applications and pass these on to the end user
In this example, there could be twenty or more players involved. An IP contamination anywhere in the food chain would affect many players.
Attention on IP purity is heightened generally when there is a transaction involved. The nature of the transaction could be:
- Investment in an organization that creates or consumes software
- An M&A (merger and acquisition) activity involving a player in the above food chain
- Public Offering by a player in the chain
- A software transfer from one player to the next in the chain, such as a contracting or software licensing event
Such transactions contribute to the creation of a whole industry centered on verifying software cleanliness through due diligence. Another contributor is the clauses on representations and warranties or IP indemnity in legal documents supporting such transactions. It is increasingly common to encounter IP lawyers and Venture Capitalists (VCs) and be regaled by stories of transactions that have been delayed or lost (i) due to the time that it takes to verify cleanliness, or (ii) due to ambiguities around IP ownership. All of which advocates the need for adopting and deploying safe software development practices.
IP Contamination is Prevalent
The scope of IP contamination is expanding, and is better understood when we examine its contributing factors and the momentum behind them.
The use of open source software (OSS) is increasing dramatically, shifting the models and metrics behind software development. Software re-use, code visibility, efficient development intervals, costs, and enhanced functionalities are some of the positive attributes driving the increasing use of open source. However, very specific licenses regulate the use of OSS and the terms of many of these licenses differ and not uncommonly require expert interpretation.
Aside from OSS, other growth areas for contamination include:
- Designer previous-life contamination, which we described earlier, is common
- Outsourcing, on-shore or off-shore, is expanding, with the additional danger of cross-project contamination
- E-bidding for software as in Elance is growing
- Collaborative development is becoming common with universities, governments and industries as players
In sum, we reiterate our belief that most contamination is unintentional, and happens when safe development standards and practices are absent in the software organizations of the food chain.
Current Prevention Methods
We have grouped the general methods of managing IP contamination into two groups: i) corrective methods, which try to detect contamination or IP policy infractions in a piece of software, and ii) preventive methods, that strive to stop unintentional penetration of undesirable code in a project.
We will further divide these methods into manual and automated techniques, and will comment on their suitability and perceived short comings.
Corrective solutions by their very nature require there to be an asset to be analyzed, and therefore are commonly employed as a prefix to the load-build process or as a suffix to the product release process.
The general objectives behind corrective solutions are: (i) to detect possible IP contamination; (ii) to identify the external source from which the IP contamination was derived; (iii) to determine the validity of suspected IP contamination; and (iv) to appropriately respond to the possible IP contamination. IP contamination can take one of two forms: (i) a complete module or file, or (ii) a snippet such as a subroutine or method within a module or file. It is noteworthy, reflecting that IP contamination is seldom malicious, that it is not uncommon for an IP contamination object to have associated comments that clearly identifies the source or copyright owner of the object.
Corrective solutions can be quite sophisticated and there can be value to just performing one or more of the above objectives. For example, it is not always necessary to identify the source from which an IP contamination object was derived. Lexical and grammatical analysis of a file can be employed to detect and flag explanation changes in coding style. Also, detection and identification may be one and the same. A precautionary sweep of all the files used in a load build may be performed to ensure there are no inadvertent and unaccounted for dependencies on commercial or open source licensed software modules.
Corrective solutions typically involve a combination of both manual and automated processes. A comprehensive and formal review may be performed by an internal team of code reviewers and legal council as part of the general development process. This may be an extension to the typical code review that is commonly employed during the development process to find bugs and improve the software. Alternatively, an external commercial due diligence service may be employed to provide an independent assessment in order, for example, to validate that the software asset being sold rightfully belongs to the seller. In either case, the process involves examining various artifacts such as the software bill-of-materials, software files, and documentation files, determining the sources of various artifacts, and interviewing developers. In short, looking for indicators of external artifacts and upon finding suspects, investigating the IP attributes of these suspects.
The more common commercially available automated tools typically address the identification of external software modules/files together with external source code within modules/files. To accomplish this function, these tools have associated repositories containing software modules/files, actual source code, and the equivalent of code fingerprints known as codeprints. These tools accomplish their objectives by comparing a given artifact against the contents in the repository. Such repositories are commonly assembled by mining the vast collection of generally available OSS as well as contributed commercial software source code, object files, library files, and executables. The contribution of such commercial software creates a win-win for everyone involved as:
- The owners of the commercial software are increasingly confident that their code is being properly used
- The developers of software are increasingly confident that their market offer is not exposed to the risks associated with inadvertently incorporating commercial software
- The tools developer has a more comprehensive and useful market offer
A few examples of the automated mechanisms used to detect and identify external code snippets include:
- Comparison and correlation of the snippet with various snippets of code existing in a repository
- Computing a codeprint for the snippet and comparing this codeprint against a repository of pre-computed codeprints
- Scanning the snippet for keywords such as "license", identification marks such as "written by", and copyright notices
As mentioned, generally a combination of manual and automated methods are employed. Two noteworthy academic projects in this field are: (i) MOSS (Measure Of Software Similarity) in Stanford University, and (ii) JPLAG in Karlsruhe University.
Limitations of Corrective Solutions
Although the steps for corrective solutions are relatively straight forward, practically they are very labor intensive, time consuming and expensive. Increasing automation makes this endeavor more feasible, which is contributing to an increasing use of such tools for software development. Which, in turn, is resulting in an increase in the innovation and availability of market offers addressing this area.
However, there are limitations to corrective solutions. Even with automation and aggregation, corrective solutions can not detect external content unless they can identify it. In other words, if an external content is not available in the tools's repository or is not otherwise detectable by being clearly marked or stylistically different, it cannot be detected. And in particular, it is extremely difficult, if not impossible, to detect proprietary external content since generally it is not available for comparison purposes.
Even if the external content does exist in the database, the comparison process is not always 100% accurate. Moreover, the comparison process is computationally intensive, requiring a tight linkage between the comparison algorithms and the repository. This typically requires co-locating the repository and the code to be examined.
Corrective solutions address the relatively well defined issue of spotting IP contamination. There are, however, a myriad of related and overlapping issues. Notable among these issues is software pedigree, or "who wrote this stuff"?. Software pedigree is concerned with determining whether or not the copyright license attached to a file/module or code snippet is properly attached and that the license really does apply. Another issue surrounds determining what constitutes a software derivative as some OSS licenses require all derivatives of a given work to be licensed under the same license as the original.
Possibly the most significant limitation associated with corrective solutions is that they are corrective, occurring after the fact. Resolution of any corrections can impact the project's completion date, transaction closing and sales cycle, and add unanticipated costs to the overall project.
As the saying goes, "An ounce of prevention is worth a pound of cure". As with corrective solutions, preventive solutions may be categorized in terms of manual and automated processes. Among the more widespread manual preventive solutions is education. Education includes organizations setting policies and rules for acceptable and safe coding practices, as well as the associated communications and training. These include the introduction of guidelines that are well documented, generally available for developer reference, and are integrated into the company's practices.
The success of education in addressing IP contamination is dependent upon a number of factors, including the education program employed and the ethical behavior of the programmers. While recognized as clearly important, education will generally be insufficient in and of itself to prevent IP contamination, both malicious and non-malicious.
Another preventive solution is setting the requirement that only specific code may be used. Again, rules must be defined, documented and communicated on the acceptable practices and sources of code. A common criticism of this solution is that it may not generally apply. For example, programmers may find that it limits their choices and needs, and that acceptable alternatives become hindered by approval processes. Pressures of deadlines and deliverables create an atmosphere of tension between following the process and adhering to the rules. It is noteworthy that we have recently seen the emergence of successful commercial ventures that offer a database of IP-indemnified, pedigreed OSS.
Automated preventive solutions rely upon the detection and identification of external content immediately upon it being introduced into a project. Integration of the preventive solution within the development environment enables detection of external content, although it may not necessarily automatically identify the source of that content. Detection can flag the introduction and optionally require the developer importing the code to annotate the source for future reference. Timely detection of the company's IP policy violations and possible immediate correction is included among the advantages of automated preventive solutions. Like corrective solutions however, preventive solutions do not address the related issues associated with determining software derivatives.
The software development industry is witnessing a transition, brought about by the explosive growth of OSS, code-search engines and outsourcing practices. The new order brought about by this transition carries certain IP challenges that must be effectively handled, otherwise the results could be catastrophic for a software company. IP policies must be set, monitored and enforced.