OpenNAPIStm is a database design standard and open source software components that support the natural products drug discovery, dietary supplements and bioenergy research communities. It is a software technology platform that users deploy to enter and manage their own proprietary research data. The user community includes representatives from industry, government and academia.
The database design standard will be finalized in Q3-2010 and software components will become available starting Q4-2011. Based in part on the commercially available NAPIS® software technology the OpenNAPIS project is supported by grant funds from the National Institute of General Medical Sciences (NIH/NIGMS).
The OpenNAPIS Project seeks to establish an open source community of scientific users and software developers to facilitate the effective use of relational databases. We invite your participation. Let us know what you think by joining the discussion at the end of this page.
This project builds on more than a decade of experience supporting the natural products (NP) research community with the NAPIS technology. A proprietary technology, it is the only commercially available database software for this market. NAPIS is widely adopted within the pharmaceutical industry and is the current database standard for the government-sponsored International Cooperative Biodiversity Groups (ICBG) program.   Alternatives to NAPIS are in-house purpose-built systems, one example being the NEXUS database developed by Merck, which tend to be process-oriented, rigid, and brittle. Of particular note are the numerous ad-hoc databases that researchers create themselves, typically as Excel spreadsheets, and which lack the standardization necessary to integrate with other databases. OpenNAPIS will bring the benefit of lessons learned throughout the NP research process in support of programs for academia, government and pharmaceutical industry organizations.
Developed with Small Business Innovation Research (SBIR) grant support from the National Cancer Institute (NIH/NCI), the award winning and patented NAPIS technology has been deployed for 21 different NP research organizations since its release in 1997. In support of these deployments, proprietary data from more than 45 different purpose-built source databases have been migrated into the NAPIS data model.
The initial NAPIS development effort involved more than 30 participating scientists from across the U.S. in the requirements specification and beta test processes. This community-based approach was perpetuated with the delivery of NAPIS Lite, a shareware version of the program for use by field biologists that has been available for free download since 1997 with more than 7,500 downloads to more than 70 countries. In addition to NP research, NAPIS Lite has been a very popular database for investigators performing biodiversity inventories.
The next step in the evolution of the NAPIS technology is to an open source architecture.
OpenNAPIS will be made available under a dual-license model.  This model allows for open source distribution, the contribution of additional code from outside developers, the curation of an official codebase and source code repository, and the ability to create and deploy the proprietary applications that are necessary for the operation of a viable business.
The dual-licensing model allows for discrimination of the rights that different recipients (developers and users) receive. Recipients of the software choose the terms under which they want to use the software. Users may choose a “open source license” that requires any work derived under it to be released under the same license. The alternative choice is a “proprietary software license” that allows for the development of proprietary applications from it. Examples of products released under a dual-license model include the MySQL database and the Mozilla Firefox web browser.
Importantly, the OpenNAPIS software components will be available to the research community as open source, and organizations that require a proprietary customized system deployed and validated specifically for them will have that option.
The open source design of this project will necessarily attract participants that we cannot identify. Considering this, we plan to use both a direct and an indirect approach to reach them. An important part of this approach includes a formal survey of our prospective users to identify their interests and requirements.
The direct approach centers on establishing a Steering Committee that will perform a “requirements analysis” based in part on the results of the user survey. It will determine the focus of our early efforts in developing software components. The outcome of this effort is a formal “requirements specification” document that defines the end-user’s perspective which is then used to manage the transition to a developer’s perspective. A second outcome will be the final OpenNAPIS Database Design Standard intended for publication in a relevant scientific journal. The Steering Committee convened its first meeting September 16, 2010 in Washington DC.
The indirect approach uses this wiki and an outreach strategy for announcing the OpenNAPIS project to societies in the NP drug discovery, dietary supplements and bioenergy research communities. Announcements for the project, with requests to be published in their newsletters and links from their websites, will be sent to the American Society of Pharmacognosy, the Society for Industrial Microbiology and the American Society for Microbiology. Interested participants from the research community will be invited to attend the Steering Committee meeting virtually via videoconference.
Members of the Steering Committee include:
|Gregg Dietzman||OpenNAPIS Product Manager||White Point Systems, Inc.||Program Director|
|Dwight Baker, PhD||NP Drug Discovery, Microbiology, Screening||White Point Systems, Inc.||Senior Scientist|
|Barbara Timmermann, PhD||Dietary Supplements, Analytical Chemistry||University of Kansas||Chair, Medicinal Chemistry|
|Paul Lewer, PhD||Analytical Chemistry, LIMS||Dow AgroSciences||Advanced Technology Development|
|Giselle Tamayo, PhD||Bioprospecting||Instituto Nacional de Biodiversidad (INBio)||Technical and Scientific Coordinator|
|Frank Koehn, PhD||NP Drug Discovery||Pfizer||Research Fellow- Natural Products|
|Toby Karyadi||Software Development - Lead||White Point Systems, Inc.||Senior Software Architect|
|John Sullivan||Software Development||The Broad Institute||Senior Software Engineer|
Participants attending August meeting include:
|Dave Newman, PhD||NP Drug Discovery||National Cancer Institute (NIH/NCI)||Chief, Natural Products Branch|
Regardless of the strategy for creating or gaining access to chemical diversity from nature, the approach for working with NP has many consistencies. For drug discovery, small molecules are isolated from crude mixtures that are extracted from the biomass samples of a producing organism. Investigation into dietary supplements and the synergistic activity of chemical compounds in mixtures is a direct extension. Discovery of enzymes that are important for bioenergy fits this scheme as well. This represents a baseline for NP research, past, present and future.
This baseline therefore represents a stable starting position for establishing a database design standard.
The overarching objective is clear. By establishing common database standards and developing ontologies that allow databases to communicate with one another, it is possible to achieve a level of interoperability that is essential for the future. The NCI’s Cancer Bioinformatics Grid caBIG® is a contemporary example of a comprehensive data management standard whose goal is the translation of molecular medicine to personalized care. The standard provides for the integration of genomics and proteomics, biospecimen management, and clinical trials with a series of mechanisms for interoperability between widely distributed data centers. Compliance with the caBIG standard is based on compatibility across components including database modeling, controlled vocabularies, and common data elements.
We propose to emulate the caBIG strategy by establishing a database design standard for the NP research community. By contrast, the needs of this community exist on a much smaller scale when compared with the caBIG landscape. We propose only a flexible and scalable database design. Controlled vocabularies and common data elements are beyond the scope of immediate needs of the NP research community at the present time, but will be discussed by the Steering Committee and may be included at a later stage.
A functional design represents the “users perspective” in software development, and storyboards are a common way to communicate it. These tools are then used to migrate a project over to a “developers perspective” and physical design for implementing a relational database. For background information on software design and relational databases there are a number of useful reviews.
The OpenNAPIS Functional Database represents the functional design elements as entities in a simple relational database. It includes the minimum set of database tables necessary to comply with the OpenNAPIS Data Standard. It is a working database design. If you are a researcher just getting started with implementing an ad-hoc database you will find this to be a very useful starting point. The table definitions provided are flexible and scalable so you can add or remove fields from the tables on an as-needed basis. Important fields that are required to comply with the standard are indicated in the table definitions. You may implement this design in an MS Excel workbook, an MS Access database, or in a client-server database like MySQL; or start in Excel and migrate up to a more powerful database backend. As your research project expands you can modify your database to match by adding tables and following the design of the OpenNAPIS Data Standard (below) which provides a stable growth path.
Download the OpenNAPIS Functional Database design with table definitions and data examples here:
The OpenNAPIS Data Standard is most formally a physical design - also referred to as an “entity relationship diagram” (ERD) - and its graphic representation shown to the right. The boxes in the ERD represent the database tables and the crows’ foot notation drawn between them represent the relationships (e.g. one-to-many). The Central Line through the model is identified with the shaded grey boxes, showing the minimum set of database tables and relationships required to comply with the OpenNAPIS standard, approximating the functional design above. When reading this ERD, remember that high-volume entities are toward the upper left corner and generally that crows fly south and east. Start reading at the “site” entity (for the geographic site where an organism is collected) in the lower right corner and work your way up through the ERD to the “activity” entity (for bioassay or other activity results) in the upper left.
Other tables shown in this data model may be added or removed on an “as needed” basis, providing a scalable and flexible architecture for NP databases of any size or scope. For example, the light-yellow shaded boxes show tables that are specific to microorganism-based research for handling the source material, engineered microorganism genealogy, cryopreserved strains, and media. Additional design layers (not shown) that add laboratory information management system (LIMS), protocol management, security and auditing functionality may be added as required. For publication in a relevant scientific journal the OpenNAPIS Data Standard will include an extensive discussion on the design rationale and options.
While this project will develop the software components for private and secure database systems, a requirement for NP drug discovery, the resulting technologies will allow researchers (when ready) to effectively contribute to outside and public data management initiatives. For example, open source research and development efforts like PubChem and ChemBase that are based on Web 2.0 concepts.
Special consideration will be given to configuration of the technologies that make them culturally adaptive for international research partners. Globalization of the data collection and the user interface may be required in some cases to respond to local cultural norms and language. The user interface might require that the forms be in the local language to support the researchers, or the underlying database system itself may be required to support different languages, for example, to record the medicinal uses of plants in the language of the people that traditionally use them.
Relational database management system (RDBMS) technologies with which we plan to support and interoperate include the open source MySQL PostgreSQL and Oracle (Oracle has free versions for academics and is imperative for the pharmaceutical industry).
Of note is that RDBMS technologies have made significant advances to meet the need for globalization within the last three years, and can now be configured to support the multilingual use of unicode character sets (e.g. Chinese ideographs) in addition to the traditional Latin/ Western character sets.
NP source microorganisms and plant and marine macroorganisms can all be managed in the OpenNAPIS database design with special consideration for the complicated relationships that may exist between them. Taxonomy name assignments of source organisms will reference available on-line databases that are typically deployed using a federated database design. Based on our initial investigation of options, we plan to work first with the Species2000 database. We plan in depth discussions on this topic during the requirements analysis and may consider alternatives. Incorporation of ethnomedical data will be discussed.
Mapping of source organisms is accomplished using spatially enabled data (i.e. with latitude longitude) from the RDMBS. A mashup with Google Earth will be implemented for display of these data using the OpenNAPIS Web Service API.
In addition to developing the software components for this area, we are also prototyping a field biology and biodiversity inventory tool, analogous to NAPIS Lite, as a smartphone “app” for the Apple iPhone and Android. This app will utilize the smartphone features for capture of geographic position and photographs and integrate them with a local RDBMS, on-line taxonomy checklists to automate data entry, a mashup with GoogleEarth, and capabiliity for upload to parent OpenNAPIS systems.
Genetic fingerprint sequences and the related attributes of their determination will be stored in the RDBMS using either the character large object (CLOB) datatype, or the variable-length character (VARCHAR2) datatype, and linked to the source organism. Generally stored for reference, they may be retrieved for analysis using Web service applications linked to public databases like GenBank or the Ribosomal Database Project. Dynamic queries will allow researchers to determine phylogenetic similarity of their source organisms with those characterized in curated public databases. Mashups of phylogenetic similarity data with chemical structure and/or spectroscopy data will increase the efficiency of structural identification of bioactive compounds.
There are evolving standards for spectroscopy data management using chemical markup XML and CMLSpect, and display using open source JSpecView that we will investigate. Spectrosopy data (e.g. UV and mass spectroscopy) acquired for chromatography experiments will be stored in the RDBMS. Export from the control software in chemical markup XML format will be imported into the RDBMS. Data retrieved from the RDBMS for data-mining and display using interoperable open source software (e.g. JSpecView) will use the same XML formatting as appropriate.
For chemoinformatics, we plan to use the open source Chemistry Development Kit (CDK) and JChem Cartridge from ChemAxon which has a liberal license structure for academics. Chemical structure storage and query will be done in the RDBMS through use of the JChem Cartridge data cartridge technology from ChemAxon that specifies a special “structure” datatype. The conversion between these data sources will be accomplished using data adaptors that will be designed and implemented as the need arises. Other chemoinformatics functionality will be investigated using the Chemistry Development Kit (CDK). The integration will be done using a loosely coupled design strategy, for example, using a ‘plugin’ strategy to accommodate other types of chemical structure cartridges. Special consideration will be given to the implementation design for the optional use of proprietary technologies in this class.