-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTSP_MonteCarlo_report_for_md.Rmd
270 lines (203 loc) · 14 KB
/
TSP_MonteCarlo_report_for_md.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
title: "TSP_MonteCarlo"
author: "Austin Nuckols"
date: "5/11/2020"
output: md_document
---
### Introduction
The traveling salesman problem (TSP) is an NP-hard combinatorial optimization problem that can be applied to different fields studying networks. In addition to exact algorithms that calculate the most optimal route through a set of locations, several heuristics have been developed to approximate solutions.
Authors Michael Hahsler and Kurt Hornik developed the TSP R package which provides the infrastructure for solving the traveling salesperson problem using a variety of methods including _Concorde_, one of the best exact TSP solvers. In this report, I apply a Monte Carlo approach to solve a TSP example using the available heuristics for solving asymmetric TSPs. Through this application, I hope 1) to increase the chances of determining an optimal route using a heuristic to solve the problem and 2) to compare different heuristic methods in their performance. In this version, these heuristics will not be compared to an exact method such as _Concorde_, so the performance of this Monte Carlo approach will not be graded against an exact algorithm. In the future, I hope to add this.
```{r Load Packages, include=FALSE}
library(TSP)
library(tidyverse)
library(rgeos)
library(rnaturalearthdata)
library(rnaturalearth)
library(sf)
```
### Identifying the Problem
I will be using the USCA50 dataset included with the TSP R package which contains the distances between 50 US cities as is already coded as an object of class 'TSP'. To note, any distance matrix can be converted to class TSP.
```{r}
data("USCA50")
dat <- USCA50
n_of_cities(dat)
head(labels(dat))
```
The original authors compare the performance of their included heuristics in the original vignette, but given that these methods approximate, the same method will not give the same output each time it is called.
```{r}
solve_TSP(dat, method = "nn")
solve_TSP(dat, method = "nn")
```
This is true even if using the set.seed() function which can return consistent outputs for other R functions that rely on randomization such as rnorm(), sample(), etc.
```{r}
set.seed(123)
solve_TSP(dat, method = "nn")
solve_TSP(dat, method = "nn")
```
Thus, repeated use of these methods using this R package will not yield consistent results. Moreover, it may not yield consistently __good__ results.
### Application of the Monte Carlo method
The Monte Carlo method is a broad class of algorithmic approaches to mathematical problems that rely on random sampling. In science, it can often be used to simulate statistical outcomes of experiments and determine the sample sizes needed to have sufficient experimental power at a given false positive threshold.
Here, I use a Monte Carlo method to simulate the TSP heuristics for a determined number of simulations to select the best performance and its associated route. I imagined a scenario where I needed to find a Hamiltonian cycle (the route in which each city is only visited once) for the 50 cities in the USCA50 dataset. Being from Atlanta, I wanted to start my route from Atlanta, GA and end in Augusta, GA in order to have a relatively short drive back home at the end of my journey.
```{r}
dat <- USCA50
m <- as.matrix(dat) #take the data from USCA50 and convert to matrix format
#Identify my cities of interest (ie my starting and ending points)
atl <- which(labels(dat) == "Atlanta, GA")
aug <- which(labels(dat) == "Augusta, GA")
atsp <- ATSP(m[-c(atl, aug), -c(atl, aug)]) #Regenerate the TSP object without the entries for the distance between Atlanta, GA and Augusta, GA
atsp <- insert_dummy(atsp, label = "ATL/AUG") #I insert a 'dummy' city in leiu of Atlanta and Augusta
atl_aug <- which(labels(atsp) == "ATL/AUG") #Apply the 'dummy' name to this object name atl_aug which will serve as 'dummy' object
atsp[atl_aug, ] <- c(m[-c(atl, aug), atl], 0) #Assign the values of the original dat matrix for Atlanta to the position in the new TSP object followed by 0
atsp[, atl_aug] <- c(m[aug, -c(atl, aug)], 0) #Assign the values of the original dat matrix for Augusta to the position in the new TSP object followed by 0
tail(as.matrix(atsp), 2)
```
This 'dummy' will be excised using the cut_tour() function in TSP to place Atlanta and Augusta as my start and end points.
```{r}
nsims = 200
methods <- c("nearest_insertion", "farthest_insertion", "cheapest_insertion", "arbitrary_insertion", "nn", "repetitive_nn", "two_opt")
start = "Atlanta, GA"
finish = "Augusta, GA"
dummy = atl_aug #Use name of dummy object that was assigned above
mc.tsp <- function(tsp, nsims, methods, begin, finish, dummy) {
l <- list()
tlength <- data.frame()
p <- c()
pname <- c()
path <- list()
path2 <- data.frame()
for (i in methods) { #For each method
for (j in c(1:nsims)) { #Simulate up to the number of sims
s <- solve_TSP(tsp, method = i)
tlength[j,i] <- attributes(s)[3] #Pulls tour length
path[[i]][j] = list(c(start, #Pulls the associated path
labels(cut_tour(s, dummy)),
finish))
}
}
best_length = tlength %>% rownames_to_column(var = "iteration") %>% #Filters for the minimum tour length (ie the best performance)
pivot_longer(cols = -iteration, names_to = "method", values_to = "value") %>%
group_by(method) %>% filter(value == min(value)) %>%
group_by(method) %>% distinct(value, .keep_all = T) %>%
mutate(id = str_c(method, iteration, sep = ""))
best <- list()
best[[1]] <- best_length
best[[2]] <- as.data.frame(unlist(path, F, T)) %>% #Pulls the tour path associated with the iteration containing the best performance
pivot_longer(everything(), names_to = "id", values_to = "value") %>%
filter(id %in% best_length$id) %>% arrange(id)
best[[3]] <- tlength %>% rownames_to_column(var = "iteration") %>% #Collects all of the simulation data in a tidy long format
pivot_longer(cols = -iteration, names_to = "method", values_to = "value")
names(best) <- c("Best Performance Distances", "Most optimal routes generated", "Full Distance Data")
return(best)
}
tst <- mc.tsp(atsp, nsims, methods, start, finish, dummy)
#What have we got?
names(tst)
tst$`Best Performance Distances` %>% arrange(value)
head(tst$`Most optimal routes generated`)
head(tst$`Full Distance Data`)
```
### Analyze the Outcome
Now, let's take a look at the route on a map. The dataset USCA312_GPS has the latitude and longitude coordinates of 312 cities on the North American continent, including the 50 in the analysis.
```{r message=FALSE, warning=FALSE}
data("USCA312_GPS")
loc <- USCA312_GPS
loc50 <- loc %>% filter(name %in% tst$`Most optimal routes generated`$value)
#apply(loc50[,1:2], 2, range) #for the mapping parameters
index <- function(vec) { #Function to return index ids that repeat like 1, 1, 2, 2, 3, 3, ..., n-1, n-1, n for number of nodes (n).
v <- c()
v[1] <- 1
v[2] <- 1
for (i in c(vec)) {
v[i+i+1] <- 1+i
v[i+i+2] <- 1+i
}
v <- v[-c(length(v)-1, length(v))]
return(v)
}
best <- tst$`Best Performance Distances` %>% ungroup() %>% filter(value <= min(value)) %>% select(id) %>% pluck(., 1)
route <- tst$`Most optimal routes generated` %>%
filter(id == best) %>% #input our best performing method and iteration from the output above
mutate(n = c(1:nrow(.))) %>% rbind(., .) %>% arrange(n) %>%
slice(-1, -n()) %>% #Remove the second instance of Atlanta and Augusta
mutate(index = index(1:(nrow(.)/2))) %>%
select(name = value, index) %>%
full_join(loc50)
head(route)
world <- ne_countries(scale = "medium", returnclass = "sf")
ggplot(data = world)+
geom_sf()+
xlab("Longitude")+ylab("Latitude")+
coord_sf(xlim = c(-160, -60), ylim = c(25, 85), expand = F)+
geom_point(data = loc50, aes(x = long, y = lat), color = "red", alpha = 0.5)+
geom_point(data = subset(route, subset = name == "Atlanta, GA" | name == "Augusta, GA"),
aes(x = long, y = lat, color = name))+
geom_line(data = route, aes(x = long, y = lat, group = index))+
scale_color_manual(values = c("Green", "Yellow"))+
theme(legend.position = "none")+
ggtitle("Optimal Route through 50 cities using Arbitrary Insertion
Algorithm")+
labs(caption = "Starting in Atlanta, GA (Green) and finishing in Augusta, GA (Yellow)")
```
What if we compare the routes of different algorithms?
```{r message=FALSE, warning=FALSE}
worst <- tst$`Best Performance Distances` %>% ungroup() %>% filter(value == max(value)) %>% select(id) %>% pluck(., 1)
route2 <- tst$`Most optimal routes generated` %>%
filter(id == worst) %>% #input our best performing method and iteration from the output above
mutate(n = c(1:nrow(.))) %>% rbind(., .) %>% arrange(n) %>%
slice(-1, -n()) %>% #Remove the second instance of Atlanta and Augusta
mutate(index = index(1:(nrow(.)/2))) %>%
select(name = value, index) %>%
full_join(loc50)
ggplot(data = world)+
geom_sf()+
xlab("Longitude")+ylab("Latitude")+
coord_sf(xlim = c(-160, -60), ylim = c(25, 85), expand = F)+
geom_point(data = loc50, aes(x = long, y = lat), alpha = 0.5)+
geom_point(data = subset(route, subset = name == "Atlanta, GA" | name == "Augusta, GA"),
aes(x = long, y = lat, color = name))+
geom_line(data = route, aes(x = long, y = lat, group = index), color = "Red")+
geom_line(data = route2, aes(x = long, y = lat, group = index), color = "Blue", alpha = 0.7)+
scale_color_manual(values = c("Green", "Yellow"))+
theme(legend.position = "none")+
ggtitle("Comparison of Arbitrary Insertion Heuristic (best; red) vs.
Nearest Insertion Heursitic (worst; blue)")+
labs(caption = "Starting in Atlanta, GA (Green) and finishing in Augusta, GA (Yellow)")
```
Interestingly, the latter half of each of these routes is very similar, focusing on the NE U.S. states after coming down from Nunavut, Canada, suggesting that the early decisions constituted the difference in distances.
When simulating this example, the Arbitrary Insertion heuristic will outperform the others consistently in finding the shortest route. However, a more complete comparison between these methods is needed. Using the "Full Distance Data" object from the mc.tsp() function, I can assess the variance in the performances of each method and more accurately compare them.
```{r}
perf <- tst$`Full Distance Data`
ggplot(perf, aes(method, value))+
geom_point(position = position_jitter(width = 0.1))+
stat_summary(geom = "crossbar",
fun.data = "mean_sdl",
fun.args = list(mult=1))+
theme_classic()+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))+
xlab("Method") + ylab("Cumulative Distance Values")+
ggtitle("Comparison of TSP heuristics over multiple simulations")+
labs(caption = "Crossbars represent mean ± standard deviation")
```
From this plot, we can assess the variance and any skewedness of the methods. There are some interesting findings. For example, while arbitrary insertion will consistently perform the best, it has a relatively high variance compared to farthest insertion. Thus, depending on the number of simulations one is able to run, farthest insertion may provide better results. The two opt method also has a high variance (higher than arbitrary insertion) but is able to yield some of the lowest distances. Cheapest insertion is not normally distributed showing a strong cluster of outcomes right around the mean and a few that drop well below two standard deviations.
Interestingly, nearest neighbors has what appears to be a bimodal distribution of outcomes with a lack of values clustered around its mean. Moreover, its best performance aligns with the performance of repetitive nearest neighbors suggesting that it cannot outperform the repetitive nearest neighbors approach but only match it. Interestingly, repetitive nearest neighbors yields the same value each time it is used.
```{r}
ggplot(perf, aes(as.integer(iteration), value, group = method, color = method))+
geom_point()+
geom_line()+
facet_wrap(~method)+
theme_bw()+
ggtitle("There is no stepwise improvement of heuristic performance over the iterations")+
xlab("Iteration") + ylab("Cumulative Distance Values")
```
These heuristics do not appear to improve their performance over successive iterations. Logically, this makes sense as there is no reward function that was implemented. Instead, each heuristic was just run repeatedly and the unbiased best performance can be selected from the outcomes.
### Conclusion
TSP heuristics can be computationally cheap and fast methods to solve TSP problems with a low to moderate number of locations. Here, I apply a Monte Carlo approach to iterate TSP heuristics and select the best performance from among them. This method can be used to compare the best performances across several TSP heuristics and can generate an optimal route.
This method can be used with any number of cities and with any starting/ending points. While I did not provide an example, any locations for which a distance matrix can be generated and geographic coordinates can be determined are suitable.
One limitation is these data use straight line distances between cities and does not account for actual road distance or other factors such as incline, etc. To note, this method could be applied to more sophisticated distance data such as that from the ggmap package. This would allow for the generation of distance matrices based on actual road distance or on other factors such as average travel time. Another limitation is the lack of the _Concorde_ method in this report. Future versions of this report will compare these heuristics to the _Concorde_ method as a way to determine the absolute performance of these heuristics compared to an exact algorithm.
### References
https://en.wikipedia.org/wiki/Travelling_salesman_problem#As_a_graph_problem
https://en.wikipedia.org/wiki/Monte_Carlo_method
Hahsler M, Hornik K (2007). “TSP - Infrastructure for the traveling salesperson problem.” _Journal of Statistical Software_,
*23*(2), 1-21. ISSN 1548-7660, doi: 10.18637/jss.v023.i02 (URL: https://doi.org/10.18637/jss.v023.i02), <URL:
http://www.jstatsoft.org/v23/i02/>.
https://github.com/dkahle/ggmap