Detecting Buildings in Satellite Images
Building footprint detection with fastai on the challenging SpaceNet7 dataset.
- Introduction
- Setup
- Preprocessing
- Data Loading Functions
- Visualizing the Data
- Validation Strategy
- Undersampling
- Creating Dataloaders
- Defining the Model
- Training
- Visualizing the Results
- Discussion
Introduction
In this notebook I implement a neural network based solution for building footprint detection on the SpaceNet7 dataset. I ignore the temporal aspect of the orginal challenge and focus on performing segmentation to detect buildings on single images. I use fastai, a deep learning library based on PyTorch. It provides functionality to train neural networks with modern best practices while reducing the amount of boilerplate code required.
The dataset is stored on AWS. Instructions how to install are here.
Defining training parameters:
BATCH_SIZE = 12 # 3 for xresnet50, 12 for xresnet34 with Tesla P100 (16GB)
TILES_PER_SCENE = 16
ARCHITECTURE = xresnet34
EPOCHS = 40
CLASS_WEIGHTS = [0.25,0.75]
LR_MAX = 3e-4
ENCODER_FACTOR = 10
CODES = ['Land','Building']
Exploring dataset structure, display sample scene directories:
scenes = path.ls().sorted()
print(f'Numer of scenes: {len(scenes)}')
pprint(list(scenes)[:5])
Which folders are in each scene (the last three have been added later during processing)
sample_scene = (path/'L15-0683E-1006N_2732_4164_13')
pprint(list(sample_scene.ls()))
How many images are in a specific scene:
images_masked = (sample_scene/'images_masked').ls().sorted()
labels = (sample_scene/'labels_match').ls().sorted()
print(f'Numer of images in scene: {len(images_masked)}')
pprint(list(images_masked[:5]))
There are 60 scenes of 4km x 4km in the dataset, each containing about 24 images over the span of two years.
Let's pick one example image and its polygons:
image, shapes = images_masked[0], labels[0]
We use the images that have UDM masks where clouds were in the original picture:
show_image(PILImage.create(image), figsize=(12,12));
This is a function to generate binary mask images from geojson vector files. Source
import rasterio
from rasterio.plot import reshape_as_image
import rasterio.mask
from rasterio.features import rasterize
import pandas as pd
import geopandas as gpd
from shapely.geometry import mapping, Point, Polygon
from shapely.ops import cascaded_union
# SOURCE: https://lpsmlgeo.github.io/2019-09-22-binary_mask/
def generate_mask(raster_path, shape_path, output_path=None, file_name=None):
"""Function that generates a binary mask from a vector file (shp or geojson)
raster_path = path to the .tif;
shape_path = path to the shapefile or GeoJson.
output_path = Path to save the binary mask.
file_name = Name of the file.
"""
#load raster
with rasterio.open(raster_path, "r") as src:
raster_img = src.read()
raster_meta = src.meta
#load o shapefile ou GeoJson
train_df = gpd.read_file(shape_path)
#Verify crs
if train_df.crs != src.crs:
print(" Raster crs : {}, Vector crs : {}.\n Convert vector and raster to the same CRS.".format(src.crs,train_df.crs))
#Function that generates the mask
def poly_from_utm(polygon, transform):
poly_pts = []
poly = cascaded_union(polygon)
for i in np.array(poly.exterior.coords):
poly_pts.append(~transform * tuple(i))
new_poly = Polygon(poly_pts)
return new_poly
poly_shp = []
im_size = (src.meta['height'], src.meta['width'])
for num, row in train_df.iterrows():
if row['geometry'].geom_type == 'Polygon':
poly = poly_from_utm(row['geometry'], src.meta['transform'])
poly_shp.append(poly)
else:
for p in row['geometry']:
poly = poly_from_utm(p, src.meta['transform'])
poly_shp.append(poly)
#set_trace()
if len(poly_shp) > 0:
mask = rasterize(shapes=poly_shp,
out_shape=im_size)
else:
mask = np.zeros(im_size)
# Save or show mask
mask = mask.astype("uint8")
bin_mask_meta = src.meta.copy()
bin_mask_meta.update({'count': 1})
if (output_path != None and file_name != None):
os.chdir(output_path)
with rasterio.open(file_name, 'w', **bin_mask_meta) as dst:
dst.write(mask * 255, 1)
else:
return mask
Show a mask:
mask = generate_mask(image, shapes)
plt.figure(figsize=(12,12))
plt.tight_layout()
plt.xticks([])
plt.yticks([])
plt.imshow(mask,cmap='cividis');
Now we create and save a mask file for every image in the 'images_masked' folder of every scene.
def save_masks():
for scene in tqdm(path.ls().sorted()):
for img in (scene/'images_masked').ls():
shapes = scene/'labels_match'/(img.name[:-4]+'_Buildings.geojson')
if not os.path.exists(scene/'binary_mask'/img.name):
if not os.path.exists(scene/'binary_mask'):
os.makedirs(scene/'binary_mask')
generate_mask(img, shapes, scene/'binary_mask', img.name)
save_masks()
The mask creation failed on one image for no obvious reason. As we are planning to just use a subset of the data anyway, I simply deleted it from the training set.
Let's look at how the images in a scene change over time:
We can see that the ~24 images of every scene are quite similar. The vegetation changes with the seasons and some scenes show building activity, but overall the similarities are greater than the differences.
Therefore I decided to ignore most images. I originally planned to keep every fifth image of every scene, so for example January, June, November, April, and September. This way we could make use of the variability of the different seasons. But it turned out that just selecting one image per scene yielded similar results with a fraction of the training time.
def get_masked_images(path:Path, n=1)->list:
"Returns the first `n` pictures from every scene"
files = []
for folder in path.ls():
files.extend(get_image_files(path=folder, folders='images_masked')[:n])
return files
masked_images = get_masked_images(path, 1)
len(masked_images)
Our dataset now consists of 60 full images.
Since the images are large (1024x1024), we cut them into 16 smaller tiles (255x255) and save them to disk. Most structures are small in relation to the whole scene, so this should not hurt training too much. Smaller tiles allow for larger batch sizes and/or deeper models to fit in GPU RAM.
Most images have 1024x1024 pixels. Some images however have only 1023 pixels in one dimension, therefore I chose 255 instead of 256 as the tile size. This throws away some pixels in most images, but maintains an equal tile size for all images.
To do: Ideally, we would create overlapping tiles to avoid some buildings being cut in half and never seen in their full shape by the model.
def cut_tiles(tile_size:int):
"Cuts the large images and masks into equal tiles and saves them to disk"
masked_images = get_masked_images(path, 5)
for fn in tqdm(masked_images):
scene = fn.parent.parent
# Create directories
if not os.path.exists(scene/'img_tiles'):
os.makedirs(scene/'img_tiles')
if not os.path.exists(scene/'mask_tiles'):
os.makedirs(scene/'mask_tiles')
# Create mask for current image
img = np.array(PILImage.create(fn))
msk_fn = str(fn).replace('images_masked', 'binary_mask')
msk = np.array(PILMask.create(msk_fn))
x, y, _ = img.shape
# Cut tiles and save them
for i in range(x//tile_size):
for j in range(y//tile_size):
img_tile = img[i*tile_size:(i+1)*tile_size,j*tile_size:(j+1)*tile_size]
msk_tile = msk[i*tile_size:(i+1)*tile_size,j*tile_size:(j+1)*tile_size]
Image.fromarray(img_tile).save(f'{scene}/img_tiles/{fn.name[:-4]}_{i}_{j}.png')
Image.fromarray(msk_tile).save(f'{scene}/mask_tiles/{fn.name[:-4]}_{i}_{j}.png')
TILE_SIZE = 255
cut_tiles(TILE_SIZE)
Little helper functions
def get_image_tiles(path:Path, n_tiles=TILES_PER_SCENE) -> L:
"Returns a list of the first `n` image tile filenames in `path`"
files = L()
for folder in path.ls():
files.extend(get_image_files(path=folder, folders='img_tiles')[:n_tiles])
return files
def get_y_fn(fn:Path) -> str:
"Returns filename of the associated mask tile for a given image tile"
return str(fn).replace('img_tiles', 'mask_tiles')
def get_y(fn:Path) -> PILMask:
"Returns a PILMask object of 0s and 1s for a given tile"
fn = get_y_fn(fn)
msk = np.array(PILMask.create(fn))
msk[msk==255] = 1
return PILMask.create(msk)
Let's look at some raw image tiles and their masks.
def show_tiles(n):
all_tiles = get_image_tiles(path)
subset = random.sample(all_tiles, n)
fig, ax = plt.subplots(n//2, 4, figsize=(14,14))
for i in range(n):
y = i//2
x = 2*i%4
PILImage.create(subset[i]).show(ctx=ax[y, x])
get_y(subset[i]).show(ctx=ax[y, x+1], cmap='cividis')
fig.tight_layout()
plt.show()
show_tiles(8)
Challenges of the dataset
As we can see in the visualizations, the dataset provides some challenges:
- The buildings are often extremely small, just a few pixels, and very close to each other
- On the other hand, there are large structures that cover a much greater area than small buildings
- Some buildings are hard to recognize, even for the human eye
- The density of buildings varies greatly. There are tiles with no buildings at all, other tiles show urban scenes with hundreds of buildings
- The images are very diverse, with great differences in topography, vegetation and urbanization
- Some tiles are covered partially or completely with UDM masks
To explore how imbalanced the data is exactly, we'll analyze the percentages of building pixels in each tile. We create a simple dataloader to easily load and analyze the masks.
tiles = DataBlock(
blocks = (ImageBlock(),MaskBlock(codes=CODES)),
get_items = get_image_tiles,
get_y = get_y
)
dls = tiles.dataloaders(path, bs=BATCH_SIZE)
dls.vocab = CODES
targs = torch.zeros((0,255,255))
for _, masks in dls[0]:
targs = torch.cat((targs, masks.cpu()), dim=0)
targs.shape
We have 786 image tiles in total.
Calculating the percentage of building pixels vs background pixels:
total_pixels = targs.shape[1]**2
percentages = torch.count_nonzero(targs, dim=(1,2))/total_pixels
plt.hist(percentages, bins=20)
plt.ylabel('Number of tiles')
plt.xlabel('Ratio of pixels that are of class `building`')
plt.gca().spines['top'].set_color('none')
plt.gca().spines['right'].set_color('none')
plt.show()
We can see that many tiles contain no or very few buildings.
torch.count_nonzero((percentages==0.).float()).item()
75 images do not contain a single pixel of the building class, that's almost 10% of the images. These can be areas of empty land, water, or tiles that were covered in clouds.
What is the tile with the largest percentage of buildings?
targs[percentages.argsort(descending=True)[0]].show();
What is the overall ratio building/background?
print(percentages.mean().item(), percentages.median().item())
On average, 6.5% of a tile's pixels are of the building class. The median is only 3.4%. This means this is a rather imbalanced dataset.
To allow the evaluation of the performance of our model, we set aside 15% of the dataset as validation set.
We must be thoughtful about how we create this validation set. Using random images would be too easy, as we have several images per scene that differ only slightly. Our validation set would not be thoroughly separated from the training set.
Instead, I chose to randomly select 9 scenes that are used as validation data. The model will never see any images from these scenes during training.
VALID_SCENES = ['L15-0571E-1075N_2287_3888_13',
'L15-1615E-1205N_6460_3370_13',
'L15-1210E-1025N_4840_4088_13',
'L15-1185E-0935N_4742_4450_13',
'L15-1481E-1119N_5927_3715_13',
'L15-0632E-0892N_2528_4620_13',
'L15-1438E-1134N_5753_3655_13',
'L15-0924E-1108N_3699_3757_13',
'L15-0457E-1135N_1831_3648_13']
def valid_split(item):
scene = item.parent.parent.name
return scene in VALID_SCENES
To help mitigating the imbalanced classes, we remove all tiles that contain no buildings at all from the training set. This reduces the amount of samples by ~10%, thereby accelerating the training while helping the model perform better.
def has_buildings(fn:Path) -> bool:
"""Returns whether the mask of a given image tile
contains at least one pixel of a building"""
fn = get_y_fn(fn)
msk = tensor(PILMask.create(fn))
count = torch.count_nonzero(msk)
return count>0.
def get_undersampled_tiles(path:Path) -> L:
"""Returns a list of image tile filenames in `path`.
For tiles in the training set, empty tiles are ignored.
All tiles in the validation set are included."""
files = get_image_tiles(path)
train_idxs, valid_idxs = FuncSplitter(valid_split)(files)
train_files = L(filter(has_buildings, files[train_idxs]))
valid_files = files[valid_idxs]
return train_files + valid_files
The following transformations seem reasonable for satellite images. We flip the tiles vertically and horizontally, rotate them, change brightness, contrast and saturation by a small amount. We normalize them according to ImageNet stats, so that we can use a pretrained model later.
tfms = [Dihedral(0.5), # Horizontal and vertical flip
Rotate(max_deg=180, p=0.9), # Rotation in any direction possible
Brightness(0.2, p=0.75),
Contrast(0.2),
Saturation(0.2),
Normalize.from_stats(*imagenet_stats)]
To create the datasets, we use the convenient DataBlock API of fastai. We only load 16 tiles per scene, so only 1 image per region.
tiles = DataBlock(
blocks = (ImageBlock(),MaskBlock(codes=CODES)), # Independent variable is Image, dependent variable is Mask
get_items = get_undersampled_tiles, # Collect undersampled tiles
get_y = get_y, # Get dependent variable: mask
splitter = FuncSplitter(valid_split), # Split into training and validation set
batch_tfms = tfms # Transforms on GPU: augmentation, normalization
)
dls = tiles.dataloaders(path, bs=BATCH_SIZE)
dls.vocab = CODES
len(dls.train_ds), len(dls.valid_ds)
We have 741 tiles in the training set and 144 tiles in the validation set.
Making sure the batches look okay:
inputs, targets = dls.one_batch()
inputs.shape, targets.shape
These dimensions are as expected:
- 12 images per batch
- 3 channels for the input images
- no color channels for the target mask
- image size: 255x255.
Check that the mask looks as expected, 0s and 1s:
targets[0].unique()
The task at hand is an image segmentation problem. In the original competition, it is required to assign individual labels to each building to keep track of it over time (instance segmentation). Here, I chose to do semantic segmentation instead, so just classifying for every pixel if it belongs to a building or not.
The fastai library allows the remarkably simple creation of a U-Net, a standard architecture for image segmentation problems. The module DynamicUNet - provided with an encoder architecture - automatically constructs a decoder and cross connections. This makes it easy to build a U-Net out of different (and pretrained) architectures. I chose this approach to have more time to experiment instead of writing code from scratch. I considered following aspects:
- Encoder: I picked a xResNet34 model that has been pretrained on ImageNet. A 34-layer encoder seems like a good compromise between accuracy and memory/compute requirements.
- Loss function: The choice of the loss function is important for segmentation problems. I'll use a weighted pixel-wise cross-entropy loss. The weights are important for the imbalanced dataset.
- Optimizer: I use the default optimizer, Adam.
-
Metrics:
- As the classes are very imbalanced, a simple accuracy metric would not be helpful. In a picture with 3% buildings, the model could predict "no building" on every pixel and still get 97% accuracy.
- Instead, I focus on the Dice metric, it is often used for segmentation tasks. It is equivalent to the F1 score and measures the ratio of $\frac{2TP}{2TP + FP + FN}$
- Additionally, I added foreground_acc of fastai, it measures the percentage of foreground pixels correctly classified, the Recall. Foreground in this case is the building class.
weights = Tensor(CLASS_WEIGHTS).cuda()
loss_func = CrossEntropyLossFlat(axis=1, weight=weights)
With some experimentation, the class weights 0.25 for the background and 0.75 for the building class seem to work fine.
learn = unet_learner(dls, # DataLoaders
ARCHITECTURE, # xResNet34
loss_func = loss_func, # Weighted cross entropy loss
opt_func = Adam, # Adam optimizer
metrics = [Dice(), foreground_acc], # Custom metrics
self_attention = False,
cbs = [SaveModelCallback(
monitor='dice',
comp=np.greater,
fname='best-model'
)]
)
Summary of the model:
learn.summary()
This is the full UNet model:
learn.model
We can use fastai's learning rate finder to pick a reasonable learning rate:
learn.lr_find(suggestions=False)
Somewhere around 1e-4 seems reasonable, where the loss decreases steadily.
lr_max = LR_MAX # 3e-4
We unfreeze the model to train the encoder and decoder simultaneously. The encoder should be trained at a lower learning rate, since we don't want to change the pretrained features too much. This is achieved by setting the learning rate to slice(lr_max/10, lr_max)
We use the fit_one_cycle
method, where the learning rate starts low for a warm up period, reaches a maximum of lr_max
and then anneals to 0 at the end of training.
learn.unfreeze()
learn.fit_one_cycle(
EPOCHS,
lr_max=slice(lr_max/ENCODER_FACTOR, lr_max),
cbs=[WandbCallback()]
)
The best model at epoch 34 has a Dice score of 0.572
. The theoretic maximum - a perfect segmentation - is 1.0
. We'll see how our results look like in the following visualizations.
learn.recorder.plot_loss()
learn.load('best-model');
probs,targets,preds,losses = learn.get_preds(dl=dls.valid,
with_loss=True,
with_decoded=True,
act=None)
Sort descending by loss:
loss_sorted = torch.argsort(losses, descending=True)
n = len(loss_sorted)
Helper function to show predictions:
def show_single_pred(index:int):
fig, ax = plt.subplots(1, 4, figsize=(20,5))
dls.valid_ds[index][0].show(ctx=ax[0]);
ax[0].set_title("Input")
show_at(dls.valid_ds, index, cmap='Blues', ctx=ax[1]);
ax[1].set_title("Target")
preds[index].show(cmap='Blues', ctx=ax[2]);
ax[2].set_title("Prediction Mask")
probs[index][1].show(cmap='viridis', ctx=ax[3]);
ax[3].set_title("Building class probability")
for idx in loss_sorted[:3]:
print(f'Tile #{idx}, loss: {losses[idx]}')
show_single_pred(idx)
All images with the highest losses show dense urban areas. What's noticable is the trouble that the model has with large buildings, which are often completely overlooked. It's also clear that very small buildings are merged into "blobs". I suspect tracking individual buildings could be difficult here.
for idx in loss_sorted[n//2-1:n//2+2]:
print(f'Tile #{idx}, loss: {losses[idx]}')
show_single_pred(idx)
The model tends to merge small buildings into larger blobs, and there are several false positives. But there are also some quite good predictions, picking up buildings that are hard even for the human eye to pick up.
for idx in loss_sorted[-21:-18]:
print(f'Tile #{idx}, loss: {losses[idx]}')
show_single_pred(idx)
The model shows mixed performance in images with few buildings in them. Overall, the accuracy looks better here than in dense areas. But the model tends to produce false positives; and some tiles show weird artifacts in the corners. It seems as if the model interprets the corners itself as buildings, especially on tiles covered with water.
Predict all tiles of a scene:
def save_predictions(scene, path=path) -> None:
"Predicts all 16 tiles of one scene and saves them to disk"
output_folder = path/scene/'predicted_tiles'
if not os.path.exists(output_folder):
os.makedirs(output_folder)
tiles = get_image_files(path/scene/'img_tiles').sorted()
for i in range(16):
tile_preds = learn.predict(tiles[i])
to_image(tile_preds[2][1].repeat(3,1,1)).save(output_folder/f'{i:02d}.png')
scene = VALID_SCENES[0] # 'L15-0571E-1075N_2287_3888_13'
save_predictions(scene)
Helper function to show several tiles as a large image:
def unblockshaped(arr, h, w):
"""
Return an array of shape (h, w) where
h * w = arr.size
If arr is of shape (n, nrows, ncols), n sublocks of shape (nrows, ncols),
then the returned array preserves the "physical" layout of the sublocks.
Source: https://stackoverflow.com/a/16873755
"""
try: # with color channel
n, nrows, ncols, c = arr.shape
return (arr.reshape(h//nrows, -1, nrows, ncols, c)
.swapaxes(1,2)
.reshape(h, w, c))
except ValueError: # without color channel
n, nrows, ncols = arr.shape
return (arr.reshape(h//nrows, -1, nrows, ncols)
.swapaxes(1,2)
.reshape(h, w))
Load saved predictions:
def get_saved_preds(scene, path=path):
"Load saved prediction mask tiles for a scene and return image + assembled mask"
image_file = (path/scene/'images_masked').ls()[0]
image = load_image(image_file)
mask_tiles = get_image_files(path/scene/'predicted_tiles').sorted()
mask_arrs = np.array(list(maps(partial(load_image, mode="L"), np.asarray, mask_tiles)))
mask_array = unblockshaped(np.array(mask_arrs), 1020, 1020)
return (image, mask_array)
Show image + stitched predictions:
def show_complete_preds(image, mask_array, scene):
"Source: https://github.com/CosmiQ/CosmiQ_SN7_Baseline/blob/master/notebooks/sn7_baseline.ipynb"
figsize = (25, 16)
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=figsize)
_ = ax0.imshow(image)
ax0.set_xticks([])
ax0.set_yticks([])
ax0.set_title('Image')
_ = ax1.imshow(mask_array, cmap='viridis')
ax1.set_xticks([])
ax1.set_yticks([])
ax1.set_title('Prediction Mask')
plt.suptitle(scene)
plt.tight_layout()
plt.savefig(os.path.join(path, scene + '_im0+mask0+dice575.png'))
plt.show()
show_complete_preds(*get_saved_preds(scene), scene)
Discussion
I'm not sure how to rate the results because I don't have any comparison. A Dice score of 0.57
doesn't sound great. But considering how difficult the dataset is and that I didn't customize the architecture at all, I'm quite pleased with the result. There is a lot to improve however! The original SpaceNet7 challenge required recognizing individual buildings and tracking them trough time, that's something I'd like to work on in the future.
What worked?
- Using a pretrained encoder.
-
Ignoring most images of the dataset. I tried using
5
instead of1
images per scene, which increased training time by5
but did not improve the results significantly. - Standard data augmentations. Without them, the model started to overfit sooner.
- Undersampling. While it did not have a large effect, it sped up training a little bit and it helped the accuracy.
-
Weighted cross-entropy loss. Without the weights, the model had a strong bias towards the dominating
background
class and failed to recognize many buildings.
What didn't?
- I hoped to get faster training with the
Mish
activation function, but training was unstable. - Dice loss instead of cross-entropy loss was unstable as well.
- I tried adding self-attention to the U-Net, I hoped it would help classifying larger structures. I did not notice a significant difference.
- A deeper xResNet50 encoder that I tried increased training time
6
-fold, but did not improve results.
Other ideas to improve the results
- Better data processing: using overlapping tiles, scaling up the image tiles
- Dynamic thresholding for turning the predicted probabilities into a binary mask.
- Implement recent advancements in segmentation models, ie UNet with ASPP or Eff-UNet
- More compute: Deeper models. Use cross-validation with several folds to utilize all 60 scenes. Ensemble different models.
Thank you for reading this far! The challenge was fun and I learned a lot. There is also a lot of room for improvement and work to do :)