The EDRN Catalog and Archive Service (eCAS) is a networked, metadata-enabled system for the capture, tracking, validation, and retrieval of cancer biomarker research data and related artifacts. eCAS promises to be an invaluable tool that will make it possible for doctors, scientists, clinicians, and researchers to share their results, correlate their data, discover promising knowledge of new biomarkers, and more; and, as a result, improving the state of public health by combating cancer. eCAS will be developed by NASA’s Jet Propulsion Laboratory (Pasadena, California) in concert with the National Cancer Institute’s Early Detection Research Network (EDRN). It will be available for EDRN members for immediate use and deployment within and between institutions.

Managing Data

A researcher tabulating results from a spectrograph has access to hundreds of data points. Finding relative minima and maxima creates new data points. Combining that data with calibration information for the instrument refines that data. Correlating those results with other spectrograph runs creates information for tracking changes over time, between specimens sampled, between instruments manufactured, and more. Publishing such data into a paper results in yet more information being synthesized. Another researcher drawing conclusions from the paper and comparing it to other papers results in yet more opportunities for synthesis of data, information, and knowledge.

However, there is also ample opportunity within these processes for data to be lost. A real-time instrument run may disappear if not properly connected to a recording device. The data may be meaningless without calibration data, without units to describe it, or without other ancillary information. Without previous runs for comparison, trust in its accuracy falls. Published into a paper, data often appears in textual tables or in statistical graphs; yet these are not forms in which others can easily reuse results in further research and collaborative studies. And when data is made available electronically, it often goes unaccompanied by the ancillary information, such as metadata descriptions or descriptive documents, which makes its interpretation and reuse possible.

This is the data management problem that eCAS will solve. eCAS uses three specific strategies:

  1. eCAS tags every datum, whether a spreadsheet, microscopy image, spectrograph, published paper, data table, or other document, with an open-ended set of extensible metadata. Metadata enables discovery and description of information, making information easier to find and useful once found.
  2. eCAS tracks every datum such that the system knows its location, version, and distribution status at every moment. Using encryption, authorization, and authentication, researchers can release a specific version of a specific document for private storage, for access by a work group, for access within an institution, or for general consumption. New data presented to the system can undergo an open-ended set of post-processing to validate or generate derived data, or perform other tasks such as notifying subscribers of interesting datasets.
  3. eCAS employs open web standards, including TLS, RDF, XML, and HTTP. eCAS can be connected to science tools simply for immediate ingestion of data. eCAS can work with departmental data sources without impacting existing operations through translation layers. eCAS can protect sensitive information, such as patient records, from prying eyes, as well as encourage distribution of experimental datasets and related documentation at controllable levels of exposure. eCAS can be linked to scientific instruments to digest data as it’s produced. And eCAS captures extensible metadata that enables evolving abilities to describe, discover, and interpret data.

eCAS Architecture

In the envisioned deployment of eCAS, institutions run eCAS software servers that participate in a peering network grid, creating a virtual organization that transcends institutional boundaries:

eCAS Nation Diagram

This data grid enables scaleable and transparent replication of data and metadata, improving availability and reliability. This architecture is mirrored within institutions, creating virtual departments; and within departments, creating virtual workgroups:

Department Diagram

Flexible access controls enable researchers to designate that a particular datum be available to specific users, workgroups, departments, organizations, and so forth.

Due to its support for standards and standard interfaces, we envision eCAS to be connected directly to science tools. eCAS can be the first step in scientific observation and data gathering, ensuring that results are immediately cataloged, archived, and tagged with metadata for refinement and discovery:

Raw Data

Metadata Descriptions

Metadata is the key to enabling use and reuse of data. Without metadata, there is little chance of discovering data, interpreting it correctly, and reusing it in an automated fashion.

Every data object ingested into eCAS comes with an optionally supplied and/or automatically generated metadata description. Metadata descriptions describe precisely the morphology, semantic context, inception, and other aspects of data. Metadata schemata in turn describe the vocabulary used to create metadata descriptions. eCAS draws from a number of metadata schemata:

  • The Dublin Core Metadata Initiative defines 15 metadata elements that have far reaching applications. The elements include classifiers such as title, description, subject keywords, languages used, creators, contributors, and so forth. Although originally inspired by library sciences, these keywords have application across numerous disciplines. For example, a ‘creator’ of a resource can be the author of a paper, and the ‘description’ could be the abstract.
  • The EDRN Common Data Elements (CDEs) define over 300 metadata elements that are ideally suited to defining cancer specimens and biomarkers. CDEs include classifiers for specimen kind, patient race, smoking history, specimen storage, cell count, temperature, volume, and more.
  • The myGrid Consortium defines a schema for life sciences and bioinformatics. This schema includes classifiers for molecular processes, genes, animals, and more.

This list is not exhaustive; in fact, the metadata system is open ended, enabling multiple concurrent metadata vocabularies to simultaneously describe a data object.

The Resource Description Format (RDF) is the metadata technology that eCAS uses to enable description, discovery, interpretation, and use of data. RDF is a semantic web technology developed by the World Wide Web Consortium (W3C). RDF uses a series of statements that describe data objects. Statements include a precise metadata schema so that an expression, such as The ‘title’ is Advanced Nucleic Sequencing Techniques, can be interpreted correctly by automated systems. For example, ‘title’ could be defined as the Dublin Core metadata element, so ambiguity between article title and the title of address for a person is avoided.

Statements themselves become data objects in RDF that can be further described or reified. This enables an implicit trust system to appear. For example, the statement, The third column contains temperature measurements in degrees centigrade, can be reified with the statement, That statement was made by Doctor Philip Jones.

The relation of data objects to metadata is depicted below:

Multiple metadata schemata provide machine usable explanations of a metadata description, which serves to describe the inception and composition of data. Data can come in a variety of flavors, including tabular datasets, videos, images, documents, and other formats.

Ingestion, Post Processing, and Storage

eCAS ingests new data through a variety of mechanisms. As previously stated, we foresee eCAS being connected directly to scientific instruments. In addition, by supporting standard network transports and interfaces, eCAS can ingest data from spreadsheets, comma-separated value files, and other media. eCAS supports upload of data over HTTP or through lightweight client application programs.

After ingestion of a data product, eCAS assigns a globally unique identifier and begins generation of metadata. Metadata includes any provided during the ingestion process as well as any generated by examining the ingested data. eCAS indexes the metadata for immediate search and retrieval. Depending on site-specific policies, such data may be immediately available for sharing by a workgroup, department, institution, or entire public network.

Post processing occurs after ingestion. Post processing consists of a series of configurable, programmable agents that are each triggered to run upon the ingestion of new kinds of data objects. Such agents can serve a variety of roles. For example, raw data from an instrument is important to save but may be less useful to a researcher. A post-processing agent can automatically generate additional data objects from the raw data, saving these processed objects as new, derived products for search/retrieval through eCAS. Other agents might generate additional, richer metadata, facilitating future data discovery. Here is an example of the processing pipeline, from raw data ingest, to agent running, to local eCAS storage:

eCAS Pipeline Processing

eCAS stores data and metadata using the file systems on which each eCAS instance is running. However, retrieval of data automatically mirrors such objects throughout the network. Such mirroring enables better availability (if a primary data source goes down) as well as higher efficiency (by pipelining data from closer and/or multiple data sources).

Search AND Retrieval

Sharing of data objects by word of mouth or through a subscription notification are some ways researchers can find interesting products and enhance collaboration. However, eCAS also supports searching and browsing of data.


Searching for data uses the eCAS web interface or client application. Using the familiar vector search provided by Google, a research can simply type in a number of terms into a search box and find matching data objects, ordered by relevance. This provides a familiar search environment that can yield results quickly.

An example of this web-based search interface to eCAS is shown below:

eCAS Prototype with Google-like Interface

Operating within the familiar web browser, researchers can simply enter text search terms that eCAS matches against metadata. eCAS ‘crawls’ its network of servers, locating data objects for which the user has access.

Facet Based Browsing

While the free-text vector based search will likely be a first place users will turn to for quickly locating eCAS data, a more controlled search is required for precise location of scientific datasets. eCAS directly provides such a feature by presenting ‘facets’ of available metadata, enabling drill-down to matching datasets.

Facet based search uses a dynamic taxonomy generated from available metadata. As an example, a drill-down through metadata facets for tissue specimens can start by choosing patients with cancer versus those without. After selecting those with, a researcher has a narrower view of results, and can constrain the search to tissue site. After selecting, say blood, the researcher might then go onto select sampled blood specimen, DNA, then specimen storage, Liquid Nitrogen, and then at this point desire to browse a list of matches.

The figure below gives an example of this drill-down. The numbers on each arrow show the number of matching data objects; green arrows show the path the research takes:

Facet Examples


Once a researcher has identified a data object to retrieve, the eCAS system arranges for delivery in an automatically optimal way. Recall that eCAS assigns a unique identifier to every stored object. Using this identifier, eCAS can determine if local, nearby eCAS servers has the data object, without necessarily resorting to the curating eCAS server. eCAS can also multiplex retrieval of data from multiple servers at once, alleviating bottlenecks at single servers and increasing network efficiency. This is eCAS’s grid based data movement feature.

Moreover, once an eCAS client has retrieved a complete data object, it too can then serve that data product to other eCAS instances. In this way, popular data is automatically mirrored throughout the network. This helps ensure timely retrieval of data, better availability of data, and higher perceived reliability of eCAS.

As an example, consider a large dataset consisting of a number of spectrograph results. A researcher at Institute Y retrieves it from the single, original eCAS source where it was ingested, Institute X. Another researcher at Institute Z needing a copy can get even numbered rows from the Institute X and odd numbered rows from Institute Y, preventing network saturation at either institute. As more and more copies are retrieved, the total upload speed at each site is reduced, and the total bandwidth of the network is increased. eCAS handles this automatically.


The EDRN Catalog and Archive Service (eCAS) solves multiple problems of data management. It prevents loss of data, encourages collaboration, facilitates discovery, enables novel correlation between datasets, and makes data reuse a tangible possibility. Using flexible access controls, metadata, grid features, and pipeline processing, eCAS promises to a worthy tool for scientific research and the eradication of cancer.