Patrick Gray (patrick.c.gray at duke) - https://github.com/patrickcgray

Chapter 1: Exploring rasterio

Introduction

GDAL - the Geospatial Data Abstraction Library is a software library for reading and writing raster and vector geospatial data formats and forms the basis of most software for processing geospatial data.

There are many formats for using GDAL ranging from graphical tools like ArcGIS or QGIS to command line GDAL tools but here we're using the fantastic rasterio python package which provides a pythonic wrapping around GDAL. Basically it reads and writes geospatial formats and provides a Python API based on numpy N-dimensional arrays and GeoJSON.

If you're coming from another language and want an overview of object oriented programming in Python, see the python like you meant it short online course.

Module import in Python

Before we can get started, we need to tell Python that we will be using functions, classes, and variables from some packages. The technical wording for this is that we need to import these modules into our namespace (see Python's documentation on the module system here).

We will do this using some import statements:

Once we import these packages Python will know where to look on our system for the code that implements them. When we want to access classes, variables, or functions within these packages, we will need to reference the full path (e.g. rasterio.open())

Examples

Open an image

When we open an image in rasterio we create a Dataset object. As the name would suggest, we can open an image with the "open" function within rasterio.

We will use an example image provided in the data directory for this chapter. This image is a subset of a Landsat 7 image containing the 8 bands on this sensor rearranged in order of wavelength (e.g., Landsat 7's second SWIR channel comes before thermal channel in our stack). The last band in this image is a cloud and cloud shadow mask from Fmask.

Now that we have this dataset open, let's explore some of its capabilities.

Image attributes

The first few pieces of information we obtained are fairly straightforward - image name, the raster size, the number of bands, a description, some metadata, and the raster's file format.

The image's projection is formatted in what's known as "Well Known Text". For more information on specific projections and for format conversions among projection description formats (e.g., proj4 string, WKT, ESRI WKT, JSON, etc.) see Spatial Reference.

The last piece of information we accessed is something called a "geotransform". This set of 6 numbers provides all the information required to and from transform pixel and projected coordinates. In this example, the first number (462405) and the fourth number (1741815) are the top left of the upper left pixel of the raster. The pixel size in x and y dimensions of the raster is listed as the second (30) and the sixth (-30) numbers. Since our raster is north up oriented, the third and fifth numbers are 0. For more information on the GDAL data model, visit this web page.

Image raster bands

now for the fun part, actually visualizing and working with the data

The rasterio Dataset object we created contains a lot of useful information but it is not directly used to read in the raster image. Instead we will need to access the raster's bands using the read() method:

When we load our raster band into memory we will read it into a NumPy 2 dimensional array. NumPy is, "the fundamental package for scientific computing with Python", because it allows us to represent our data in a very memory efficient way.

NumPy arrays are the cornerstone or building block of the rest of the Scientific Python suite of software. Get familiar with them:

Just as we made the routines and data types from rasterio available to us using import, we loaded up NumPy. When we import NumPy, we also gave it an alias so that we don't have to type numpy every time we want to use it.

The method read() takes arguments that allow us to specify a subset of the raster bands, specific X and Y offsets and sizes of the bands and much more. Remember this ability when you want to process large images or are working with a limited amount of memory. In these circumstances, you will run out of memory if you read the entire dataset in at once. Instead, read in a block of some number of columns and rows at one time, perform your computation and store your output, and then chunk through the rest of the image.

Read more here: https://rasterio.readthedocs.io/en/latest/api/rasterio.io.html#rasterio.io.BufferedDatasetWriter.read

For now, because this image is small, we'll just read in and display the entire image:

With our data read into a NumPy array, we can print it to console and even perform statistics on it. In addition to helping us store massive amounts of data efficiently, NumPy will help us with some basic linear algebra, numerical operations, and summary statistics.

For now let's plot that near infrared band we read in earlier.

The next chapter (link to webpage or Notebook) puts these lessons to use in order to calculate the Normalized Difference Vegetation Index (NDVI).