Overcoming data limitation challenges in predicting tropical storm surge with interpretable machine learning methods




Stanton, Carly

Journal Title

Journal ISSN

Volume Title




The impacts of climate change have increased the risk of storm surge flooding in coastal areas. Tropical islands are especially vulnerable to the effects of sea level rise and the increase in frequency and intensity of tropical cyclones (TCs). Typically, storm surge prediction is performed using a combination of numerical forecasting models, synoptic forecasting, and statistical methods. Machine learning techniques, particularly convolutional neural networks (CNNs), have shown promise in accurately predicting storm surge levels in the short term. However, deep learning methods are computationally expensive and require large amounts of data to train their models. Often researchers must train neural network models on synthetic data generated by numerical models. The goal of this work is to study the effectiveness of simpler, interpretable models, including random forest (RF) regression, multiple linear regression (MLR), and support vector machine regression (SVR), to predict storm surge in San Juan Bay, Puerto Rico using limited local meteorological and tidal data and hurricane reanalysis data from actual storm events over the last few decades. These algorithms were used to predict surge at five different lead times from one hour to 24 hours and were trained on three different feature sets with two different types of training data windows. Models were trained using a leave-one-out cross-validation (LOOCV) approach, in which data for one TC was separated out for each model as a validation dataset. The performance of the models and different training methods was compared in terms of root mean square error (RMSE), normalized RMSE, and error at peak surge. It was found that an RF model trained on data from only eight TCs was able to predict the peak surge of Hurricane Irma to within about 0.03 m and predicted time of peak surge within three hours at lead times up to 12 hours as long as one extreme TC event, in this case Hurricane Maria, was included in the training data. However, all models failed to accurately predict surge for Hurricane Maria, even when including other high-surge storms in the training data. Other training methods achieved lower RMSE when validated against a peak surge window from the 12 hours prior to 12 hours after peak surge, but could not approach the accuracy of the RF model at predicting the time of peak surge.


A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science


machine learning, predictive analytics, random forests, storm surge, tropical cyclone