Weather Patterns Using Hadoop and R – Part 2

The first part of this post looked at data mining of world weather patterns such as cloud location and cloud coverage levels.

As these features describe aggregate values and ratios they are quite simple to evaluate with Hadoop, R and MapReduce Streaming. However for time dependant analytics, such as how long rainfall happens, additional issues need to be addressed.

Time Dependent Data Mining

Time based weather analytics focuses on weather at any location as time passes. This is done by extracting data using R so that Hadoop sorts data in the shuffle phase by weather station then time. However this is quite inefficient as, for any station, data is scattered across every file in the dataset.

Filtering down to 6 stations spread across the globe significantly reduces the data load for Hadoop’s shuffle phase, while there remains sufficient data to look at world weather and regional patterns.

The 6 stations are chosen across the globe:

  • Arctic and Antarctic – Danmarkshavn,Greenland and Base Belgrano II, an Argentinian run Antarctic weather station.
  • Mid Latitudes – Dublin Airport, Ireland and Christchurch, New Zealand.
  • Tropics – Khartoum, Sudan and La Paz, Bolivia.

The following R script brings this all together to form the Mapper:

# Mapper script
#! /usr/bin/env Rscript
stations <- c(" 4320", #Danmarkshavn
              " 3969", #Dublin Airport
              "62721", #Khartoum
              "85201", #La Paz
              "93890", #Christchurch
              "89674") #Base Belgrano II

#days in standard 365 day year
months <- c(31,28,31,30,31,30,31,31,30,31,30,31)
#days in year
#from 1981 to 1991, inclusive
years <- c(365,365,365,366,365,365,365,366,365,365,365)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
{
 if(length(stations[stations==substr(line,20,24)])!=0)
 {
    if(substr(line,1,2)=="84"
     | substr(line,1,2)=="88")
    {
     months[2] <- 29
    } else {
     months[2] <- 28
    }

   cat(sprintf("%s%5d%s%d\n",
 #land station synoptic code
 substr(line,20,24),                                
#time in number of hours since Jan 1, 1981 00:00
 sum(years[81:91<as.integer(substr(line,1,2))])*24
+sum(months[1:12<as.integer(substr(line,3,4))])*24
+(as.integer(substr(line,5,6))-1)*24
+as.integer(substr(line,7,8)),                
          substr(line,33,38),                          
 #precipitation level
 as.integer(as.numeric(substr(line,26,27))/10)
     ))
 }
}
close(con)

Using this script with an identity reducer data mining can be performed from the output in R, outside Hadoop. Note that a map-only job (achieved with the command line argument -D mapred.reduce.tasks=0) won’t shuffle the data, so an identity reducer is necessary for this task.

Results

When Precipitation Happens

As noted in the previous post, nimbostratus always occurs with precipitation. For stratus, 45% of the time this cloud appears precipitation develops before either the stratus disappears or 24 hours have passed. With other cloud types this rate ranges from 5% to 26%.

As mentioned before, high-level cloud types are not precipitation inducing so any data mining results relating to precipitation are meaningless[1].

How Long Precipitation Lasts

Within the 6 chosen stations nimbostratus precipitation lasts for up to 6 days, corresponding with the expectation that precipitation “is generally steady and prolonged”[1]. With stratus, precipitation can also last for up to 6 days, with cirrostratus for up to 4.5 days while for other cloud types the maximum is 3 days.

Cloud Development Patterns

Nimbostratus – follows altocumulus slightly more often than altostratus. May also form following occurrence of cirrus cloud cover and is roughly equally as likely to develop into altocumulus as altostratus.

Cirrus – at the 6 stations often develops into cirrostratus. Cirrostratus is known to form from “spreading and joining of cirrus clouds”[1].

Cumulus – at the chosen stations usually develops into stratocumulus or cumulonimbus. Indeed cumulonimbus is known to form from “upwardly mobile cumulus congestus clouds (thermals)”[1].

Stratocumulus – 31% of occurrences are followed by altocumulus. Indeed
altocumulus can form “by subdivision of a layer of stratocumulus”[2]. 27% of stratocumulus occurrences are followed by cumulus. Indeed “cumulus may originate from … stratocumulus”[3]. Also, 33% of stratocumulus occurrences are followed by stratus. This is line with the expectation that “stratus may develop from stratocumulus”[4].

Stratus – is more likely to develop into stratocumulus than no low cloud cover.

Conclusion

Again, data mining results compare quite well with established cloud facts and world weather satellite data, albeit this part of the study restricts itself to 6 land based locations.

The set of stations can be changed so that the trade-off between accuracy and time running Hadoop is optimal while various data analytics, such as regional weather effects, can be extracted. The main drawback is that this data mining involves re-running the MapReduce job each time the station list is changed.

References

  1. ^ a b c Common Cloud Names, Shapes, and Altitudes
  2. ^ Altocumulus
  3. ^ Cumulus
  4. ^ Stratus

One thought on “Weather Patterns Using Hadoop and R – Part 2

Leave a comment