1.1 Learning Outcome
By the end of this hands-on exercise you will be able create the followings data visualisation by using R packages:
plotting a calender heatmap by using ggplot2 functions,
plotting a cycle plot by using ggplot2 function,
plotting a slopegraph
plotting a horizon chart
1.2 Getting Started
Write a code chunk to check, install and launch the following R packages: scales
, viridis
, lubridate
, ggthemes
, gridExtra
, readxl
, knitr
, data.table
, and tidyverse
.
required_packages <- c ("scales" , "viridis" , "lubridate" , "ggthemes" , "gridExtra" ,
"readxl" , "knitr" , "data.table" , "tidyverse" , "CGPfunctions" )
installed <- rownames (installed.packages ())
to_install <- setdiff (required_packages, installed)
if (length (to_install) > 0 ) {
install.packages (to_install, repos = "https://cloud.r-project.org" )
}
lapply (required_packages, library, character.only = TRUE )
Loading required package: viridisLite
Attaching package: 'viridis'
The following object is masked from 'package:scales':
viridis_pal
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ purrr 1.0.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between() masks data.table::between()
✖ readr::col_factor() masks scales::col_factor()
✖ dplyr::combine() masks gridExtra::combine()
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::first() masks data.table::first()
✖ data.table::hour() masks lubridate::hour()
✖ data.table::isoweek() masks lubridate::isoweek()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::last() masks data.table::last()
✖ data.table::mday() masks lubridate::mday()
✖ data.table::minute() masks lubridate::minute()
✖ data.table::month() masks lubridate::month()
✖ data.table::quarter() masks lubridate::quarter()
✖ data.table::second() masks lubridate::second()
✖ purrr::transpose() masks data.table::transpose()
✖ data.table::wday() masks lubridate::wday()
✖ data.table::week() masks lubridate::week()
✖ data.table::yday() masks lubridate::yday()
✖ data.table::year() masks lubridate::year()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
[[1]]
[1] "scales" "stats" "graphics" "grDevices" "utils" "datasets"
[7] "methods" "base"
[[2]]
[1] "viridis" "viridisLite" "scales" "stats" "graphics"
[6] "grDevices" "utils" "datasets" "methods" "base"
[[3]]
[1] "lubridate" "viridis" "viridisLite" "scales" "stats"
[6] "graphics" "grDevices" "utils" "datasets" "methods"
[11] "base"
[[4]]
[1] "ggthemes" "lubridate" "viridis" "viridisLite" "scales"
[6] "stats" "graphics" "grDevices" "utils" "datasets"
[11] "methods" "base"
[[5]]
[1] "gridExtra" "ggthemes" "lubridate" "viridis" "viridisLite"
[6] "scales" "stats" "graphics" "grDevices" "utils"
[11] "datasets" "methods" "base"
[[6]]
[1] "readxl" "gridExtra" "ggthemes" "lubridate" "viridis"
[6] "viridisLite" "scales" "stats" "graphics" "grDevices"
[11] "utils" "datasets" "methods" "base"
[[7]]
[1] "knitr" "readxl" "gridExtra" "ggthemes" "lubridate"
[6] "viridis" "viridisLite" "scales" "stats" "graphics"
[11] "grDevices" "utils" "datasets" "methods" "base"
[[8]]
[1] "data.table" "knitr" "readxl" "gridExtra" "ggthemes"
[6] "lubridate" "viridis" "viridisLite" "scales" "stats"
[11] "graphics" "grDevices" "utils" "datasets" "methods"
[16] "base"
[[9]]
[1] "forcats" "stringr" "dplyr" "purrr" "readr"
[6] "tidyr" "tibble" "ggplot2" "tidyverse" "data.table"
[11] "knitr" "readxl" "gridExtra" "ggthemes" "lubridate"
[16] "viridis" "viridisLite" "scales" "stats" "graphics"
[21] "grDevices" "utils" "datasets" "methods" "base"
[[10]]
[1] "CGPfunctions" "forcats" "stringr" "dplyr" "purrr"
[6] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
[11] "data.table" "knitr" "readxl" "gridExtra" "ggthemes"
[16] "lubridate" "viridis" "viridisLite" "scales" "stats"
[21] "graphics" "grDevices" "utils" "datasets" "methods"
[26] "base"
1.3 Plotting Calendar Heatmap
1.3.1 The Data
For the purpose of this hands-on exercise, eventlog.csv
file will be used. This data file consists of 199,999 rows of time-series cyber attack records by country.
1.3.2 Importing the data
attacks <- read_csv ("/Users/sharon/OneDrive - Singapore Management University/isss608data/hands-on_exercise6/eventlog.csv" )
Rows: 199999 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): source_country, tz
dttm (1): timestamp
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
1.3.3 Examining the data structure
2015-03-12 15:59:16
CN
Asia/Shanghai
2015-03-12 16:00:48
FR
Europe/Paris
2015-03-12 16:02:26
CN
Asia/Shanghai
2015-03-12 16:02:38
US
America/Chicago
2015-03-12 16:03:22
CN
Asia/Shanghai
2015-03-12 16:03:45
CN
Asia/Shanghai
1.3.4 Data Preparation
make_hr_wkday <- function (ts, sc, tz) {
real_times <- ymd_hms (ts, tz = tz[1 ], quiet = TRUE )
dt <- data.table (source_country = sc,
wkday = weekdays (real_times),
hour = hour (real_times))
return (dt)
}
wkday_levels <- c ('Saturday' , 'Friday' , 'Thursday' , 'Wednesday' ,
'Tuesday' , 'Monday' , 'Sunday' )
attacks <- attacks %>%
group_by (tz) %>%
do (make_hr_wkday (.$ timestamp, .$ source_country, .$ tz)) %>%
ungroup () %>%
mutate (wkday = factor (wkday, levels = wkday_levels),
hour = factor (hour, levels = 0 : 23 ))
kable (head (attacks))
Africa/Cairo
BG
Saturday
20
Africa/Cairo
TW
Sunday
6
Africa/Cairo
TW
Sunday
8
Africa/Cairo
CN
Sunday
11
Africa/Cairo
US
Sunday
15
Africa/Cairo
CA
Monday
11
1.3.5 Building the Calendar Heatmaps
grouped <- attacks %>%
count (wkday, hour) %>%
ungroup () %>%
na.omit ()
ggplot (grouped, aes (hour, wkday, fill = n)) +
geom_tile (color = "white" , size = 0.1 ) +
theme_tufte (base_family = "Helvetica" ) +
coord_equal () +
scale_fill_gradient (name = "# of attacks" , low = "sky blue" , high = "dark blue" ) +
labs (x = NULL , y = NULL , title = "Attacks by weekday and time of day" ) +
theme (axis.ticks = element_blank (),
plot.title = element_text (hjust = 0.5 ),
legend.title = element_text (size = 8 ),
legend.text = element_text (size = 6 ))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
1.3.6 Building Multiple Calendar Heatmaps
1.3.7 Plotting Multiple Calendar Heatmaps
attacks_by_country <- count (attacks, source_country) %>%
mutate (percent = percent (n/ sum (n))) %>%
arrange (desc (n))
top4 <- attacks_by_country$ source_country[1 : 4 ]
top4_attacks <- attacks %>%
filter (source_country %in% top4) %>%
count (source_country, wkday, hour) %>%
ungroup () %>%
mutate (source_country = factor (source_country, levels = top4)) %>%
na.omit ()
ggplot (top4_attacks, aes (hour, wkday, fill = n)) +
geom_tile (color = "white" , size = 0.1 ) +
theme_tufte (base_family = "Helvetica" ) +
coord_equal () +
scale_fill_gradient (name = "# of attacks" , low = "sky blue" , high = "dark blue" ) +
facet_wrap (~ source_country, ncol = 2 ) +
labs (x = NULL , y = NULL , title = "Attacks on top 4 countries by weekday and time of day" ) +
theme (axis.ticks = element_blank (),
axis.text.x = element_text (size = 7 ),
plot.title = element_text (hjust = 0.5 ),
legend.title = element_text (size = 8 ),
legend.text = element_text (size = 6 ))
1.4 Plotting Cycle Plot
air <- read_excel ("/Users/sharon/OneDrive - Singapore Management University/isss608data/hands-on_exercise6/arrivals_by_air.xlsx" )
air$ month <- factor (month (air$ ` Month-Year ` ), levels= 1 : 12 , labels= month.abb, ordered= TRUE )
air$ year <- year (ymd (air$ ` Month-Year ` ))
Vietnam <- air %>%
select (` Vietnam ` , month, year) %>%
filter (year >= 2010 )
hline.data <- Vietnam %>%
group_by (month) %>%
summarise (avgvalue = mean (` Vietnam ` , na.rm = TRUE ))
ggplot () +
geom_line (data= Vietnam,
aes (x= year, y= ` Vietnam ` , group= month), colour= "black" ) +
geom_hline (aes (yintercept= avgvalue), data= hline.data, linetype= 6 , colour= "red" , size= 0.5 ) +
facet_grid (~ month) +
labs (axis.text.x = element_blank (),
title = "Visitor arrivals from Vietnam by air, Jan 2010–Dec 2019" ) +
xlab ("" ) +
ylab ("No. of Visitors" ) +
theme_tufte (base_family = "Helvetica" )
1.5 Plotting Slopegraph
rice <- read_csv ("/Users/sharon/OneDrive - Singapore Management University/isss608data/hands-on_exercise6/rice.csv" )
Rows: 550 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Country
dbl (3): Year, Yield, Production
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rice %>%
mutate (Year = factor (Year)) %>%
filter (Year %in% c (1961 , 1980 )) %>%
newggslopegraph (Year, Yield, Country,
Title = "Rice Yield of Top 11 Asian Countries" ,
SubTitle = "1961–1980" ,
Caption = "Prepared by: Dr. Kam Tin Seong" )
Converting 'Year' to an ordered factor