Patrick Gray (patrick.c.gray at duke) - https://github.com/patrickcgray
In many cases in the earth sciences you will be working with remote sensing data that has dense time series and is highly dimensional (e.g. lat x long x time x measurements). Often times working with this data as a typical numpy array my be clunky and error prone and keeping the data in the labeled format with metadata could help resolve many of the errors a data scientists might make. Additionally many of the datasets are so large they can not effectively fit in memory or should be processed in parallel. Here we'll introduce xarray
which effectively resolves nearly all of these challenges. From the documentation:
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”) are an essential part of computational science. They are encountered in a wide range of fields, including physics, astronomy, geoscience, bioinformatics, engineering, finance, and deep learning. In Python, NumPy provides the fundamental data structure and API for working with raw ND arrays. However, real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.
Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.
Xarray is inspired by and borrows heavily from pandas, the popular data analysis package focused on labeled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray’s data model, and integrates tightly with dask for parallel computing.
One of the other exciting aspects of xarray
is the ability to use cloud-based data directly off the web and only actually download and process it as needed.
In this chapter we'll be using xarray
to process ocean color data from the the Ocean Colour Climate Change Initiative project (https://www.oceancolour.org/) from the northern Atlantic Ocean. We'll also be leveraging holoviews
and geoviews
in the high level plotting API hvplot
. Holoviews is an impressively powerful python library for interactive . Geoviews "is a Python library that makes it easy to explore and visualize geographical, meteorological, and oceanographic datasets, such as those used in weather, climate, and remote sensing research." And hvplot
is a high level API that provides a consistent and simple interface for using holoviews
and geoviews
when plotting any of its supported data types (e.g. Pandas DataFrames, XArray: Datasets, Dask DataFrames, GeoPandas: GeoDataFrames, and much more.
%matplotlib inline
import netCDF4
import matplotlib.pyplot as plt
# numpy
import numpy as np
import dask
# xarray (very handy)
import xarray as xr
import rasterio
# http://geo.holoviews.org/index.html
import holoviews as hv
import geoviews as gv
import geoviews.feature as gf
import warnings
import sys
gv.extension('bokeh')
da = xr.DataArray([9, 0, 2, 1, 0])
da
<xarray.DataArray (dim_0: 5)> array([9, 0, 2, 1, 0]) Dimensions without coordinates: dim_0
array([9, 0, 2, 1, 0])
da = xr.DataArray([9, 0, 2, 1, 0], dims=['x'])
da
<xarray.DataArray (x: 5)> array([9, 0, 2, 1, 0]) Dimensions without coordinates: x
array([9, 0, 2, 1, 0])
da = xr.DataArray([9, 0, 2, 1, 0],
dims=['x'],
coords={'x': [10, 20, 30, 40, 50]})
da
<xarray.DataArray (x: 5)> array([9, 0, 2, 1, 0]) Coordinates: * x (x) int64 10 20 30 40 50
array([9, 0, 2, 1, 0])
array([10, 20, 30, 40, 50])
da.plot(marker='o')
[<matplotlib.lines.Line2D at 0x7fd601891f10>]
We'll just have xarray
pull the data in directly from the cloud using OPeNDAP. We're getting this data from the Ocean Colour Climate Change Initiative project (https://www.oceancolour.org/) which compiles a global merged chlorophyll-a product.
This dataset can be easily opened from an OPeNDAP link using the xarray open_dataset()
function though it may take a minute or two:
chla_ds = xr.open_dataset('https://www.oceancolour.org/thredds/dodsC/CCI_ALL-v5.0-5DAY?lat[1000:1:1550],lon[2350:1:2750],time[1450:1:1650],chlor_a[1450:1:1650][1000:1:1550][2350:1:2750]')
chla_ds
<xarray.Dataset> Dimensions: (lat: 551, lon: 401, time: 201) Coordinates: * lat (lat) float64 48.31 48.27 48.23 48.19 ... 25.52 25.48 25.44 25.4 * lon (lon) float64 -82.06 -82.02 -81.98 -81.94 ... -65.48 -65.44 -65.4 * time (time) datetime64[ns] 2017-06-25 2017-06-30 ... 2020-03-21 Data variables: chlor_a (time, lat, lon) float32 ... Attributes: (12/50) _NCProperties: version=1|netcdflibversion=4.4.1.1|hdf... Conventions: CF-1.7 Metadata_Conventions: Unidata Dataset Discovery v1.0 cdm_data_type: Grid comment: See summary attribute creator_email: help@esa-oceancolour-cci.org ... ... time_coverage_resolution: P5D time_coverage_duration: P5D start_date: 26-DEC-2020 00:00:00.000000 stop_date: 30-DEC-2020 23:59:00.000000 time_coverage_start: 202012260000Z time_coverage_end: 202012302359Z
array([48.3125 , 48.270833, 48.229167, ..., 25.479167, 25.4375 , 25.395833])
array([-82.0625 , -82.020833, -81.979167, ..., -65.479167, -65.4375 , -65.395833])
array(['2017-06-25T00:00:00.000000000', '2017-06-30T00:00:00.000000000', '2017-07-05T00:00:00.000000000', ..., '2020-03-11T00:00:00.000000000', '2020-03-16T00:00:00.000000000', '2020-03-21T00:00:00.000000000'], dtype='datetime64[ns]')
[44411151 values with dtype=float32]
This dataset is a bit larger than we want so we'll slice the dataset down to an area of interest using the .sel() function and the slice() function which allow the user to simply put in the direct lat and lon bounds they want.
chla_ds = chla_ds.sel(lat=slice(44,26),lon=slice(-82,-66))
chla_ds
<xarray.Dataset> Dimensions: (lat: 432, lon: 384, time: 201) Coordinates: * lat (lat) float64 43.98 43.94 43.9 43.85 ... 26.15 26.1 26.06 26.02 * lon (lon) float64 -81.98 -81.94 -81.9 -81.85 ... -66.1 -66.06 -66.02 * time (time) datetime64[ns] 2017-06-25 2017-06-30 ... 2020-03-21 Data variables: chlor_a (time, lat, lon) float32 ... Attributes: (12/50) _NCProperties: version=1|netcdflibversion=4.4.1.1|hdf... Conventions: CF-1.7 Metadata_Conventions: Unidata Dataset Discovery v1.0 cdm_data_type: Grid comment: See summary attribute creator_email: help@esa-oceancolour-cci.org ... ... time_coverage_resolution: P5D time_coverage_duration: P5D start_date: 26-DEC-2020 00:00:00.000000 stop_date: 30-DEC-2020 23:59:00.000000 time_coverage_start: 202012260000Z time_coverage_end: 202012302359Z
array([43.979167, 43.9375 , 43.895833, ..., 26.104167, 26.0625 , 26.020833])
array([-81.979167, -81.9375 , -81.895833, ..., -66.104167, -66.0625 , -66.020833])
array(['2017-06-25T00:00:00.000000000', '2017-06-30T00:00:00.000000000', '2017-07-05T00:00:00.000000000', ..., '2020-03-11T00:00:00.000000000', '2020-03-16T00:00:00.000000000', '2020-03-21T00:00:00.000000000'], dtype='datetime64[ns]')
[33343488 values with dtype=float32]
the main data variable is chlor_a
which we access as an attribute of the ds object:
chla_ds.chlor_a.isel(time=1).plot()
<matplotlib.collections.QuadMesh at 0x7fd608afa820>
Note how it usefully auto labels the axes in the plot. We can use normal matplotlib syntax to modify this plot to be a bit more visually appealing.
from matplotlib.colors import LogNorm
fig,ax = plt.subplots(figsize=(12,8))
chla_ds = chla_ds.sel(lat=slice(44,26),lon=slice(-82,-66))
chla_ds.chlor_a.isel(time=1).plot(ax=ax, cmap='jet', norm=LogNorm(vmin=0.01, vmax=10))
#ax.set_ylim(ax.get_ylim()[::-1])
plt.show()
How about if we plot it averaging along the longitude and time dimensions and to show average chla amount across latitudes? We'll also just select a time subset to make this go faster.
chla_ds.chlor_a.isel(time=slice(30,50)).mean(dim=('lon', 'time')).plot()
[<matplotlib.lines.Line2D at 0x7fd6016b0880>]
If you want to know what times we just averaged you can see them here:
chla_ds.chlor_a.isel(time=slice(30,50)).time
<xarray.DataArray 'time' (time: 20)> array(['2017-11-22T00:00:00.000000000', '2017-11-27T00:00:00.000000000', '2017-12-02T00:00:00.000000000', '2017-12-07T00:00:00.000000000', '2017-12-12T00:00:00.000000000', '2017-12-17T00:00:00.000000000', '2017-12-22T00:00:00.000000000', '2017-12-27T00:00:00.000000000', '2018-01-01T00:00:00.000000000', '2018-01-06T00:00:00.000000000', '2018-01-11T00:00:00.000000000', '2018-01-16T00:00:00.000000000', '2018-01-21T00:00:00.000000000', '2018-01-26T00:00:00.000000000', '2018-01-31T00:00:00.000000000', '2018-02-05T00:00:00.000000000', '2018-02-10T00:00:00.000000000', '2018-02-15T00:00:00.000000000', '2018-02-20T00:00:00.000000000', '2018-02-25T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2017-11-22 2017-11-27 ... 2018-02-25 Attributes: axis: T standard_name: time _ChunkSizes: 1
array(['2017-11-22T00:00:00.000000000', '2017-11-27T00:00:00.000000000', '2017-12-02T00:00:00.000000000', '2017-12-07T00:00:00.000000000', '2017-12-12T00:00:00.000000000', '2017-12-17T00:00:00.000000000', '2017-12-22T00:00:00.000000000', '2017-12-27T00:00:00.000000000', '2018-01-01T00:00:00.000000000', '2018-01-06T00:00:00.000000000', '2018-01-11T00:00:00.000000000', '2018-01-16T00:00:00.000000000', '2018-01-21T00:00:00.000000000', '2018-01-26T00:00:00.000000000', '2018-01-31T00:00:00.000000000', '2018-02-05T00:00:00.000000000', '2018-02-10T00:00:00.000000000', '2018-02-15T00:00:00.000000000', '2018-02-20T00:00:00.000000000', '2018-02-25T00:00:00.000000000'], dtype='datetime64[ns]')
array(['2017-11-22T00:00:00.000000000', '2017-11-27T00:00:00.000000000', '2017-12-02T00:00:00.000000000', '2017-12-07T00:00:00.000000000', '2017-12-12T00:00:00.000000000', '2017-12-17T00:00:00.000000000', '2017-12-22T00:00:00.000000000', '2017-12-27T00:00:00.000000000', '2018-01-01T00:00:00.000000000', '2018-01-06T00:00:00.000000000', '2018-01-11T00:00:00.000000000', '2018-01-16T00:00:00.000000000', '2018-01-21T00:00:00.000000000', '2018-01-26T00:00:00.000000000', '2018-01-31T00:00:00.000000000', '2018-02-05T00:00:00.000000000', '2018-02-10T00:00:00.000000000', '2018-02-15T00:00:00.000000000', '2018-02-20T00:00:00.000000000', '2018-02-25T00:00:00.000000000'], dtype='datetime64[ns]')
This gives a little glimpse at how easy it is to handle complex datasets with xarray
Let's look at the chl-a across time and longitude, averaged for the latitude range of 34-35. This is effectively off Cape Hatteras and out into the ocean.
chla_ds.chlor_a.sel(lat=slice(35,34.5)).mean(dim=('lat'), skipna=True).plot( cmap='jet', norm=LogNorm(vmin=0.01, vmax=10))
<matplotlib.collections.QuadMesh at 0x7fd60166dd90>
Now this is really cool! We can see intense seasonality out in the open ocean and along the coastal areas with the most productivity in the winter both for the ocean and coast.
Let's take the average chla in two different regions, one north and one south of the Gulf Stream, and look at them across time. First let's see where the two regions are spatially:
import matplotlib.patches as patches
fig,ax = plt.subplots(figsize=(12,8))
chla_ds = chla_ds.sel(lat=slice(44,26),lon=slice(-82,-66))
chla_ds.chlor_a.isel(time=1).plot(ax=ax, cmap='jet', norm=LogNorm(vmin=0.01, vmax=10))
rect = patches.Rectangle((-71, 33), 1, 1, linewidth=1, edgecolor='black', facecolor='b', alpha=0.5)
ax.add_patch(rect)
rect = patches.Rectangle((-71, 39), 1, 1, linewidth=1, edgecolor='black', facecolor='g', alpha=0.5)
ax.add_patch(rect)
#ax.set_ylim(ax.get_ylim()[::-1])
plt.show()
Now let's select them and plot them across time:
fig,ax = plt.subplots()
chla_ds.chlor_a.sel(lat=slice(33,32), lon=slice(-71,-70)).mean(dim=('lat', 'lon')).plot(ax=ax, color='blue')
chla_ds.chlor_a.sel(lat=slice(39,38), lon=slice(-71,-70)).mean(dim=('lat', 'lon')).plot(ax=ax, color='green')
plt.show()
As expected the region north of the Gulf Stream with more access to nutrients is has much more chl-a and it has a more intense spring bloom.
As a final step before going into interactive visualizations check out how easy it is to resample the dataset temporally. Let's make it a monthly dataset instead of 8 day time steps:
chla_ds_monthly = chla_ds.chlor_a.resample(time="1M").mean()
fig,ax = plt.subplots()
chla_ds_monthly.sel(lat=slice(33,32), lon=slice(-71,-70)).mean(dim=('lat', 'lon')).plot(ax=ax, color='blue')
chla_ds_monthly.sel(lat=slice(39,38), lon=slice(-71,-70)).mean(dim=('lat', 'lon')).plot(ax=ax, color='green')
plt.show()
hvplot
and see what it can do¶import hvplot
import hvplot.pandas
import hvplot.xarray
import cartopy.crs as crs
Let's make a similar plot to the one above where we selected the first time step but now let's make it interactive and with the ability to step through each time point. We'll also download the coastline and add that to the plot.
Play around with the buttons on the left that allow you to zoom, save as an image, and move around the plot.
proj = crs.Orthographic(-90, 30)
chla_ds.chlor_a.hvplot.quadmesh(
'lon', 'lat', projection=proj, project=True,
cmap='jet', dynamic=True, coastline='10m',
frame_width=500, logz=True, clim=(0.01,20), rasterize=True)
Now let's looks at chla averaged for a subset of our time series and plotted spatially. Again play with the zooming and panning tools.
chla_ds.chlor_a.isel(time=slice(0,40)).mean(dim='time').hvplot(x="lon", y="lat", width=500, height=300, cmap='jet', logz=True, clim=(0.01,20), rasterize=True)
Hopefully you're begining to see how powerful this interactivity can be for explorng a dataset.
Pretty amazing already, and while it is relatively clear in this oceanographic dataset, it would be useful to have a better idea of where we are in the world!
We'll pull in cartopy
and geoviews.feature
to give projections and global features respectively. Explore options at https://scitools.org.uk/cartopy/docs/v0.15/crs/projections.html and http://geoviews.org/user_guide/Geometries.html.
import geoviews.feature as gf
# the * operator allows you to add multiple datasets to the same plot
# you can also use the + operator to add multiple plots to the same overall figure
gf.ocean * \
gf.land * \
chla_ds.chlor_a.isel(time=slice(0,40)).mean(dim='time').hvplot(x="lon", y="lat", width=500, height=300, cmap='jet', logz=True, clim=(0.01,20), global_extent=True, projection=proj) *\
gf.coastline * \
gf.borders
Now we're actually getting into some informative plotting and edging into cartography and many of these tools incredibly useful for analyzing remote sensing data. And that ends our series, congrats on making it through the whole thing and please add an issue on github if you have any problems or suggestions for additions!