Patrick Gray (patrick.c.gray at duke) - https://github.com/patrickcgray

Chapter 8: Use xarray to handle N-dimensional arrays and interactive plotting with geoviews

In many cases in the earth sciences you will be working with remote sensing data that has dense time series and is highly dimensional (e.g. lat x long x time x measurements). Often times working with this data as a typical numpy array my be clunky and error prone and keeping the data in the labeled format with metadata could help resolve many of the errors a data scientists might make. Additionally many of the datasets are so large they can not effectively fit in memory or should be processed in parallel. Here we'll introduce xarray which effectively resolves nearly all of these challenges. From the documentation:

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”) are an essential part of computational science. They are encountered in a wide range of fields, including physics, astronomy, geoscience, bioinformatics, engineering, finance, and deep learning. In Python, NumPy provides the fundamental data structure and API for working with raw ND arrays. However, real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.

Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.

Xarray is inspired by and borrows heavily from pandas, the popular data analysis package focused on labeled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray’s data model, and integrates tightly with dask for parallel computing.

drawing


One of the other exciting aspects of xarray is the ability to use cloud-based data directly off the web and only actually download and process it as needed.

In this chapter we'll be using xarray to process ocean color data from the the Ocean Colour Climate Change Initiative project (https://www.oceancolour.org/) from the northern Atlantic Ocean. We'll also be leveraging holoviews and geoviews in the high level plotting API hvplot. Holoviews is an impressively powerful python library for interactive . Geoviews "is a Python library that makes it easy to explore and visualize geographical, meteorological, and oceanographic datasets, such as those used in weather, climate, and remote sensing research." And hvplot is a high level API that provides a consistent and simple interface for using holoviews and geoviews when plotting any of its supported data types (e.g. Pandas DataFrames, XArray: Datasets, Dask DataFrames, GeoPandas: GeoDataFrames, and much more.

Simple xarray DataArrays

Working with our Ocean Color Dataset

We'll just have xarray pull the data in directly from the cloud using OPeNDAP. We're getting this data from the Ocean Colour Climate Change Initiative project (https://www.oceancolour.org/) which compiles a global merged chlorophyll-a product.

This dataset can be easily opened from an OPeNDAP link using the xarray open_dataset() function though it may take a minute or two:

This dataset is a bit larger than we want so we'll slice the dataset down to an area of interest using the .sel() function and the slice() function which allow the user to simply put in the direct lat and lon bounds they want.

Now that we have the data let's explore it a bit

the main data variable is chlor_a which we access as an attribute of the ds object:

let's just plot it normally selecting the first element in the time dimension

Note how it usefully auto labels the axes in the plot. We can use normal matplotlib syntax to modify this plot to be a bit more visually appealing.

How about if we plot it averaging along the longitude and time dimensions and to show average chla amount across latitudes? We'll also just select a time subset to make this go faster.

If you want to know what times we just averaged you can see them here:

This gives a little glimpse at how easy it is to handle complex datasets with xarray

But a lot of the real power of xarray comes in when we have dense time series so let's look at all of our data.

Let's look at the chl-a across time and longitude, averaged for the latitude range of 34-35. This is effectively off Cape Hatteras and out into the ocean.

Now this is really cool! We can see intense seasonality out in the open ocean and along the coastal areas with the most productivity in the winter both for the ocean and coast.

Let's take the average chla in two different regions, one north and one south of the Gulf Stream, and look at them across time. First let's see where the two regions are spatially:

Now let's select them and plot them across time:

As expected the region north of the Gulf Stream with more access to nutrients is has much more chl-a and it has a more intense spring bloom.

As a final step before going into interactive visualizations check out how easy it is to resample the dataset temporally. Let's make it a monthly dataset instead of 8 day time steps:

caution this resample operation may take a fair amount of memory and is liable to crash if you're running on a Binder instance with low memory

Now let's bring in hvplot and see what it can do

Let's make a similar plot to the one above where we selected the first time step but now let's make it interactive and with the ability to step through each time point. We'll also download the coastline and add that to the plot.

Play around with the buttons on the left that allow you to zoom, save as an image, and move around the plot.

Now let's looks at chla averaged for a subset of our time series and plotted spatially. Again play with the zooming and panning tools.

Hopefully you're begining to see how powerful this interactivity can be for explorng a dataset.

Pretty amazing already, and while it is relatively clear in this oceanographic dataset, it would be useful to have a better idea of where we are in the world!

We'll pull in cartopy and geoviews.feature to give projections and global features respectively. Explore options at https://scitools.org.uk/cartopy/docs/v0.15/crs/projections.html and http://geoviews.org/user_guide/Geometries.html.

Final Wrapup

Now we're actually getting into some informative plotting and edging into cartography and many of these tools incredibly useful for analyzing remote sensing data. And that ends our series, congrats on making it through the whole thing and please add an issue on github if you have any problems or suggestions for additions!