UC Berkeley templates
Berkeley Water Center
BWC logo
Data Synthesis Cyber Infrastructure

Data Synthesis CyberInfrastructure

Through collaboration with Microsoft Research, the Berkeley Water Center has developed a cyber infrastructure for environmental data to address the crisis that is impacting all of the environmental sciences.  The atmospheric science community must quantify climatic processes and anticipate their changes in response to natural and human-induced changes in atmospheric composition, land use, and dynamic variability.  The academic researchers worldwide have responded with Fluxnet programs that rapidly measure the exchange of heat, water, carbon dioxide, and other trace species (the fluxes) between the atmosphere and the land surface in a worldwide network.  Each of these measuring stations internally generates extensive data and the challenge has been to share the data among the research stations to achieve broader scale synthesis.  The water resource sector is uniquely challenged with different monitoring agencies and regulatory bodies each responsible for a small piece of the water resource system. For example, precipitation is measured and reported by a federal agency state-wide organizations, and local entities, while surface waters are evaluated by another agency for water quality and water allocations are most often regulated at the state agency level.  Water supply is regulated by another organization different from flood control, which are often times in conflict with agencies tasked with ecosystem protection and restoration.  The need for a Digital Watershed is easily recognized under these circumstances to have a central clearing house for data analysis.   The overriding goal for both Fluxnet and the Digital Watershed is to develop a system where all interested parties have access to the same data and use that data for synthesis in understanding complex environmental systems particularly where the human presence is felt. 

Our approach to addressing these data challenges is through the development of data ingest protocols; software technology that efficiency stores, archives, and updates data from disparate sources; and interfaces to that data that empower environmental scientist to spend their time in analysis and synthesis, not data management.  Both Fluxnet and the Digital Watershed started with prototypes of modest size and complexity to demonstrate capability followed by scaling to larger systems.  In Fluxnet the primary audience was academic and national laboratory scientists who were engaged in data collection efforts who needed tools for data management and collaboration. Their research questions and data needs dictated the development of Fluxet prototype for the Ameriflux community and that system was scaled to serve the La Thuile synthesis workshop of the world-wide FLUXNET community.  In contrast, the approach taken in the Digital Watershed effort was to determine the needs of the water resource managers who had to balance often conflicting demands for scarce and uncertain water resources.  The prototype in this case was the relatively small Russian River watershed in northern California where there is considerable uncertainty in how to restore migratory salmon while still providing reliable water for municipal and agricultural uses. Once the basic framework for the Digital Watershed was established for the Russian River, it was expanded to include all of California.  As the scope of the applications expand in disciplinary and geographical space, additional functionality is added to serve the needs of the user communities.

Technology

Participants

Fluxnet

Digital Watershed – Russian River Examples

Groundwater

Center for Information Technology Research in the Interest of Society (CITRIS) Project on California Groundwater Data

Advanced Scientific Capability for Environmental Management (ASCEM) at Department of Energy Sites

Watershed Evapotranspiration

Flux Tower in California /Migratory salmon going over a rubber dam on the Russuan River -- Photo credit: Sonoma County Water Agency

 

Technology

There are several key infrastructures we have built for environmental data management and synthesis.

  • Semi-automated data ingest tools that provide updates to the data on the servers.
  • Database and associated schema for storage of the data.
  • Data cubes providing pre-defined multi-dimensional views of data over various time intervals and spatial scales.
  • Data browsing and reports available via the web using Excel pivot tables and MatLab as interfaces.
  • Compatible with Geographical Information Systems for watershed delimitation and interfacing with aerial and satellite imagery.
  • Support for multiple data versions.

Access to the database is provided through datacubes. A datacube organizes the data into dimensions and pre-calculates many aggregations along a dimension such as time to have immediately available daily, monthly and yearly values. Frequently desired calculations such as cumulative value, average, mean, max, and min are provided in the cube, with more extensive analysis done through an Excel or MatLab interface. Typically there will be several cubes available at any one time on a server. These cubes will usually provide access to a subset of the data. For instance, datacubes are frequently referenced to the period covered with updates generating a new datacube retaining the prior datacube when it was utilized for particular calculations.  Access and use of the datacubes and their interfaces are described in tutorials found at Scientific Data Server User Manual http://bwc.berkeley.edu/DataServerdefault.htm

Participants

The CyberInfrastructure Research Area of the Berkeley Water Center combines expertise of campus faculty in Engineering and Natural Resources with earth and computational scientists at Lawrence Berkeley National Laboratory.  The work is being undertaken in close collaboration with researchers at Microsoft Research and frequent interactions Sonoma County Water Agency, National Marine Fisheries Service, and the US Bureau of Reclamation.

Fluxnet

                                           
The Fluxnet site contains tools that assist scientists to acquire, query, plot and manipulate diverse combinations of data from many sites, for many years and with various independent variables. These tools are the product of a collaboration with database specialists at the national laboratories (Lawrence Berkeley National Laboratory and Oak Ridge National Laboratory , Max Planck Institute of Biogeochemistry, Jena ), Universities (Tuscia, Virginia, California-Berkeley) and industry (Microsoft). The fluxnet can be found at ://www.fluxdata.org/default.aspx

 

Digital Watershed – Russian River Examples

The power of the datacube is most easily demonstrated by the ease of access to data as partially illustrated here.  The intent here is to show what is in the California datacube using examples from the Russian River basin, and in addition show some data synthesis efforts that are straightforward applications of the datacube.  All the graphs are generated by plotting from Excel spreadsheets populated by queries using pivot tables interfaced with the datacube. The details on use of the datacube are described in the User Manual.

The first logical question relates to how much and what type of data is available. The US Geological Survey (USGS) records for the Russian River Basin are extensive and that was one of the reasons for picking that watershed as a prototype.  The two plots below represent the number of measurements within the datacube by decade on the left for four common parameters and then the plot on the right shows the distribution of data by water year for the 2000 decade.  In 1987 USGS started reporting flow data every 15 minutes and that caused a dramatic increase in data counts. 

Data Counts

 

The availability of water quality data at 15-minute intervals started in the 2000 water decade and the breakdown by water year in the plot shows that data became available in water year 2002. In general since water temperature is easier to measure than dissolved oxygen and turbidity, there are slightly more measurements for water temperature than dissolved oxygen and turbidity. These pivot table plots could break each data type into the number of counts by month and water year to clearly denote where data are missing.  Such visualization tools are essential when trying to resolve data availability when there could be dozens of stations and many parameters at each station.  This becomes particularly useful when the simultaneous availability of different types of data is required such as temperature and dissolved oxygen levels as demonstrated below.    

Access to the actual data values is essential in modeling, visualization and comparisons.  The next figure below provides a typical comparison of daily stream flow in the Russian River at Guerneville for water years 2005 to 2010.  The winter period has rapid fluctuations in flows, and the summer and fall are dominated by groundwater seepage into the river and upstream reservoir releases. 

Comparing USGS flow and NOAA precipitation data obtained from their individual web sources is tedious given the different calendars and data formatting.  The following plot of daily flow and daily precipitation for the 2006 water year was obtained directly off the datacube by specifying through pivot table commands flow and precipitation data were desired, the time period was the 2006 water year, daily values were needed, and the flow data was from the Russian River near Guerneville and the precipitation record was from Healdsburg, located within the middle of the watershed. This plot demonstrates the flashiness of the Russian River with its rapid response to precipitation events.
 

In the analysis of water quality, many of the variables are reported as concentrations and frequently mass loading rates are required as the product of a concentration and a flow rate.  The next plot demonstrates how the data visualization can rapidly assess water quality issues related to sediment transport.  The plot on the right has daily average turbidity data plotted against the average flow for that day, repeated for six water years.  The data have the expected power law dependence of turbidity on flow rate and have some slight interannual variability.  In some circumstances this relationship between flow and turbidity can estimate sediment transport for years where flow rate data are available but the turbidity measurements are missing.


The next plot considers the amount of dissolved oxygen within river as a measure of aquatic health.  The daily dissolved oxygen data reported for the 2006 water year is plotted against the maximum water temperature for that day.  The dashed line represents the dissolved oxygen solubility in freshwater as reported in http://www.engineeringtoolbox.com/oxygen-solubility-water-d_841.html .  The Russian River does not have a significant depression in dissolved oxygen over the 2006 water year, particularly in the summer and early fall when water temperatures can be their highest and flows the lowest.

Groundwater

Groundwater quantity and quality is extremely important to the vitality of California, and represents a testbed for developing new tools for resource assessment, management and often remediation. With a population of over 30 million people, and an agricultural economy based on intensive irrigation, and large urban industrial areas, there are a wide range of activities that have the potential to deplete, contaminate or otherwise jeopardize the groundwater in California. Groundwater typically moves slowly through pathways that are difficult to detect from the surface or limited information available from well drilling logs. We currently do not know how basin-scale groundwater quality will respond to contamination, overdraft, salt water intrusion, or irrigation with wastewater. Furthermore, we do not know how global change, which will influence the snowmelt runoff and thus groundwater recharge, will influence groundwater quality and sustainability.  There is considerable research underway within the Berkeley Water Center to assess the availability of groundwater data, improve on data accessibility, provide tools for the visualization of measurements and modeling results, contribute to the operation of groundwater basins and assist in the remediation of contaminated aquifers.

Center for Information Technology Research in the Interest of Society (CITRIS) Project on California Groundwater Data

The Berkeley Water Center, Berkeley’s Geospatial Innovation Facility, and the U.C. Davis Information Center for the Environment are collaborating to make California’s groundwater data more easily accessible to the research community, as well as the general public.   Although California is frequently a leader in environmental arenas, we fall short in providing a comprehensive program to monitor and then regulate the use of groundwater. Though the regulation of groundwater has been considered on several occasions, the California Legislature has repeatedly held that groundwater management should remain a local responsibility, leading to a dispersed set of information maintained in different formats and managed to different standards.  The  state government is facing growing pressure regulate groundwater with a forced recognition that surface and groundwaters are connected, their use is closely coupled, and groundwater basins could become the only way to obtain new storage capacity for managing future uncertainties in hydrologic variability.  This seed funded effort anticipates changes in groundwater management that will be imposed throughout California in the coming years, starting with the data needs. 
Our inter-campus team has broad experience in environmental databases, geographic information systems, and internet visualization. Associate Professor and Cooperative Extension Specialist Maggi Kelly is the Faculty Director of the Geospatial Innovation Facility at UC Berkeley http://gif.berkeley.edu/ that supports geospatial research and centralizes access to specialized hardware and software.  The staff have experience in web mapping, high spatial resolution imagery analysis and geospatial outreach. Professor Jim Quinn is the director of the Information Center for the Environment at UC Davis http://ice.ucdavis.edu/ and specializes in the development and dissemination of geospatial data and technologies and the creation of decision support systems geared toward improving the capabilities of environmental resource managers at a state-wide level.
Ongoing efforts in this research include the evaluation of existing tools for groundwater data discovery and accessibility to various users.  Following on the spirit of other Berkeley Water Center projects, a few selected geographical areas are used to assess the amount and character of the groundwater quantity and quality data to better understand the needs of the local user communities as well as broader needs in statewide data management that is archival, compatible with other systems, accessible to a broad range of interested parties, and documented.  The research team is cognizant of issues related to the proprietary nature of well logs and security concerns in identifying well locations.  The goal is to utilize these testbeds to demonstrate the power of improved data management and form the basis for scaling the analysis to larger geographical regions.

 

Advanced Scientific Capability for Environmental Management (ASCEM) at Department of Energy Sites

One legacy of the production of nuclear weapons in the United States and elsewhere is the presence of massively contaminated sites with long-term public health threats and expensive remediation costs.  Waste materials were either intentionally or unintentionally released to the subsurface environment and have contaminated soils and groundwater aquifers.  Additionally, vast sums have been spent to assessing the risks at these sites and investigating remedial approaches, but these sites are expected with current technologies to remain contaminated and require expensive, ongoing maintenance for decades.  Waste materials include many radioactive isotopes formed in the nuclear fuel production process but include other contaminants found at hazardous waste site such as toxic metals, organic solvents and acids. 


Representation of the F-Area seepage basin at Savannah River Site following closure of the basin and installation of a treatment barrier.  
http://esd.lbl.gov/files/about/staff/susanhubbard/ASCEM_Phase_I_Demonstration_signed_1-11-11.pdf

Researchers at the Berkeley Water Center and the Computer Science Division at LBNL are applying datacube technology to the challenges of groundwater contaminant transport and remediation at these Department of Energy Sites.  Groundwater data is unlike surface water systems in that the information is three dimensional, the data are infrequently collected, and information about the subsurface is generally uncertain given limited information from well logs and geophysical surveys. Besides the data management efforts, this group will couple their efforts with other national laboratory research teams investigating enhanced visualization tools, high performance computing for simulating transport and remediation, and quantification of overall uncertainties in models based on measurements and assumed dominant transport processes.    

Watershed Evapotranspiration

 

One of the greatest uncertainties in hydrologic science is the amount of water lost by evapotranspiration largely from vegetation.  Through separate efforts in the Berkeley Water Center collaboration, estimates of evapotranspiration are being examined two separate ways to better understand the mechanistic representation of plant transpiration.  This effort is combining approaches from the Fluxnet community and the Digital Watershed effort.  The availability of localized flux towers that measure at high frequency water loss from the land surface into the atmosphere and satellite imagery that can be related to plant coverage, moisture dynamics and temperature, it is possible to calculate at the watershed scale evapotranspiration for over a decade.  Similarly, the availability of spatially distributed precipitation for a watershed and the runoff from that watershed measured by a gauge gives an estimate of evapotranspiration assuming little change in storage over that yearly time period.  This integration of satellite imagery, geospatial data, and monitoring data is essential in advancing our quantitative understanding of hydrologic science.

 

The watershed outline of the Russian River drainage for the gauge at Ukiah, California, combined with the 2003 water year field determined by PRISM at 4km by 4 km grids (http://www.prism.oregonstate.edu/)