TSP_MonteCarlo_report_for_md.Rmd

---
title: "TSP_MonteCarlo"
author: "Austin Nuckols"
date: "5/11/2020"
output: md_document
---

### Introduction

The traveling salesman problem (TSP) is an NP-hard combinatorial optimization problem that can be applied to different fields studying networks. In addition to exact algorithms that calculate the most optimal route through a set of locations, several heuristics have been developed to approximate solutions. 

Authors Michael Hahsler and Kurt Hornik developed the TSP R package which provides the infrastructure for solving the traveling salesperson problem using a variety of methods including _Concorde_, one of the best exact TSP solvers. In this report, I apply a Monte Carlo approach to solve a TSP example using the available heuristics for solving asymmetric TSPs. Through this application, I hope 1) to increase the chances of determining an optimal route using a heuristic to solve the problem and 2) to compare different heuristic methods in their performance. In this version, these heuristics will not be compared to an exact method such as _Concorde_, so the performance of this Monte Carlo approach will not be graded against an exact algorithm. In the future, I hope to add this. 

```{r Load Packages, include=FALSE}
library(TSP)
library(tidyverse)
library(rgeos)
library(rnaturalearthdata)
library(rnaturalearth)
library(sf)
```

### Identifying the Problem

I will be using the USCA50 dataset included with the TSP R package which contains the distances between 50 US cities as is already coded as an object of class 'TSP'. To note, any distance matrix can be converted to class TSP.

```{r}
data("USCA50")
dat <- USCA50
n_of_cities(dat)
head(labels(dat))
```

The original authors compare the performance of their included heuristics in the original vignette, but given that these methods approximate, the same method will not give the same output each time it is called.

```{r}
solve_TSP(dat, method = "nn")
solve_TSP(dat, method = "nn")
```

This is true even if using the set.seed() function which can return consistent outputs for other R functions that rely on randomization such as rnorm(), sample(), etc.

```{r}
set.seed(123)
solve_TSP(dat, method = "nn")
solve_TSP(dat, method = "nn")
```

Thus, repeated use of these methods using this R package will not yield consistent results. Moreover, it may not yield consistently __good__ results.

### Application of the Monte Carlo method

The Monte Carlo method is a broad class of algorithmic approaches to mathematical problems that rely on random sampling. In science, it can often be used to simulate statistical outcomes of experiments and determine the sample sizes needed to have sufficient experimental power at a given false positive threshold.

Here, I use a Monte Carlo method to simulate the TSP heuristics for a determined number of simulations to select the best performance and its associated route. I imagined a scenario where I needed to find a Hamiltonian cycle (the route in which each city is only visited once) for the 50 cities in the USCA50 dataset. Being from Atlanta, I wanted to start my route from Atlanta, GA and end in Augusta, GA in order to have a relatively short drive back home at the end of my journey.

```{r}
dat <- USCA50
m <- as.matrix(dat) #take the data from USCA50 and convert to matrix format

#Identify my cities of interest (ie my starting and ending points)
atl <- which(labels(dat) == "Atlanta, GA")
aug <- which(labels(dat) == "Augusta, GA")

atsp <- ATSP(m[-c(atl, aug), -c(atl, aug)]) #Regenerate the TSP object without the entries for the distance between Atlanta, GA and Augusta, GA
atsp <- insert_dummy(atsp, label = "ATL/AUG") #I insert a 'dummy' city in leiu of Atlanta and Augusta
atl_aug <- which(labels(atsp) == "ATL/AUG") #Apply the 'dummy' name to this object name atl_aug which will serve as 'dummy' object

atsp[atl_aug, ] <- c(m[-c(atl, aug), atl], 0) #Assign the values of the original dat matrix for Atlanta to the position in the new TSP object followed by 0
atsp[, atl_aug] <- c(m[aug, -c(atl, aug)], 0) #Assign the values of the original dat matrix for Augusta to the position in the new TSP object followed by 0

tail(as.matrix(atsp), 2)
```

This 'dummy' will be excised using the cut_tour() function in TSP to place Atlanta and Augusta as my start and end points.

```{r}
nsims = 200
methods <- c("nearest_insertion", "farthest_insertion", "cheapest_insertion", "arbitrary_insertion", "nn", "repetitive_nn", "two_opt")
start = "Atlanta, GA"
finish = "Augusta, GA"
dummy = atl_aug #Use name of dummy object that was assigned above

mc.tsp <- function(tsp, nsims, methods, begin, finish, dummy) {
  
  
  l <- list()
  tlength <- data.frame()
  p <- c()
  pname <- c()
  path <- list()
  path2 <- data.frame()
  
  for (i in methods) { #For each method

      for (j in c(1:nsims)) { #Simulate up to the number of sims
    
        s <- solve_TSP(tsp, method = i)
        tlength[j,i] <- attributes(s)[3] #Pulls tour length
        
        path[[i]][j] = list(c(start, #Pulls the associated path
              labels(cut_tour(s, dummy)),
              finish))
        
      }
    }
    
    best_length = tlength %>% rownames_to_column(var = "iteration") %>% #Filters for the minimum tour length (ie the best performance)
      pivot_longer(cols = -iteration, names_to = "method", values_to = "value") %>% 
      group_by(method) %>% filter(value == min(value)) %>%
      group_by(method) %>% distinct(value, .keep_all = T) %>% 
      mutate(id = str_c(method, iteration, sep = ""))
    
    best <- list()
    best[[1]] <- best_length
    best[[2]] <- as.data.frame(unlist(path, F, T)) %>% #Pulls the tour path associated with the iteration containing the best performance
      pivot_longer(everything(), names_to = "id", values_to = "value") %>%   
      filter(id %in% best_length$id) %>% arrange(id)
    best[[3]] <- tlength %>% rownames_to_column(var = "iteration") %>% #Collects all of the simulation data in a tidy long format
      pivot_longer(cols = -iteration, names_to = "method", values_to = "value")
    
    names(best) <- c("Best Performance Distances", "Most optimal routes generated", "Full Distance Data")
    
    return(best)
}

tst <- mc.tsp(atsp, nsims, methods, start, finish, dummy)

#What have we got?

names(tst)
tst$`Best Performance Distances` %>% arrange(value)
head(tst$`Most optimal routes generated`)
head(tst$`Full Distance Data`)
```

### Analyze the Outcome

Now, let's take a look at the route on a map. The dataset USCA312_GPS has the latitude and longitude coordinates of 312 cities on the North American continent, including the 50 in the analysis.

```{r message=FALSE, warning=FALSE}
data("USCA312_GPS")
loc <- USCA312_GPS
loc50 <- loc %>% filter(name %in% tst$`Most optimal routes generated`$value)

#apply(loc50[,1:2], 2, range) #for the mapping parameters

index <- function(vec) { #Function to return index ids that repeat like 1, 1, 2, 2, 3, 3, ..., n-1, n-1, n for number of nodes (n).
  v <- c()
  v[1] <- 1
  v[2] <- 1
  for (i in c(vec)) {
    v[i+i+1] <- 1+i
    v[i+i+2] <- 1+i
  }
  v <- v[-c(length(v)-1, length(v))]
  return(v)
}

best <- tst$`Best Performance Distances` %>% ungroup() %>% filter(value <= min(value)) %>% select(id) %>% pluck(., 1)

route <- tst$`Most optimal routes generated` %>% 
  filter(id == best) %>% #input our best performing method and iteration from the output above
  mutate(n = c(1:nrow(.))) %>%  rbind(., .) %>% arrange(n) %>% 
  slice(-1, -n()) %>% #Remove the second instance of Atlanta and Augusta
  mutate(index = index(1:(nrow(.)/2))) %>% 
  select(name = value, index) %>% 
  full_join(loc50)

head(route)

world <- ne_countries(scale = "medium", returnclass = "sf")
ggplot(data = world)+
  geom_sf()+
  xlab("Longitude")+ylab("Latitude")+
  coord_sf(xlim = c(-160, -60), ylim = c(25, 85), expand = F)+
  geom_point(data = loc50, aes(x = long, y = lat), color = "red", alpha = 0.5)+
  geom_point(data = subset(route, subset = name == "Atlanta, GA" | name == "Augusta, GA"), 
             aes(x = long, y = lat, color = name))+
  geom_line(data = route, aes(x = long, y = lat, group = index))+
  scale_color_manual(values = c("Green", "Yellow"))+
  theme(legend.position = "none")+
  ggtitle("Optimal Route through 50 cities using Arbitrary Insertion 
Algorithm")+
  labs(caption = "Starting in Atlanta, GA (Green) and finishing in Augusta, GA (Yellow)")
```

What if we compare the routes of different algorithms?

```{r message=FALSE, warning=FALSE}
worst <- tst$`Best Performance Distances` %>% ungroup() %>% filter(value == max(value)) %>% select(id) %>% pluck(., 1)

route2 <- tst$`Most optimal routes generated` %>% 
  filter(id == worst) %>% #input our best performing method and iteration from the output above
  mutate(n = c(1:nrow(.))) %>%  rbind(., .) %>% arrange(n) %>% 
  slice(-1, -n()) %>% #Remove the second instance of Atlanta and Augusta
  mutate(index = index(1:(nrow(.)/2))) %>% 
  select(name = value, index) %>% 
  full_join(loc50)

ggplot(data = world)+
  geom_sf()+
  xlab("Longitude")+ylab("Latitude")+
  coord_sf(xlim = c(-160, -60), ylim = c(25, 85), expand = F)+
  geom_point(data = loc50, aes(x = long, y = lat), alpha = 0.5)+
  geom_point(data = subset(route, subset = name == "Atlanta, GA" | name == "Augusta, GA"), 
             aes(x = long, y = lat, color = name))+
  geom_line(data = route, aes(x = long, y = lat, group = index), color = "Red")+
  geom_line(data = route2, aes(x = long, y = lat, group = index), color = "Blue", alpha = 0.7)+
  scale_color_manual(values = c("Green", "Yellow"))+
  theme(legend.position = "none")+
  ggtitle("Comparison of Arbitrary Insertion Heuristic (best; red) vs.
Nearest Insertion Heursitic (worst; blue)")+
  labs(caption = "Starting in Atlanta, GA (Green) and finishing in Augusta, GA (Yellow)")
```

Interestingly, the latter half of each of these routes is very similar, focusing on the NE U.S. states after coming down from Nunavut, Canada, suggesting that the early decisions constituted the difference in distances. 

When simulating this example, the Arbitrary Insertion heuristic will outperform the others consistently in finding the shortest route. However, a more complete comparison between these methods is needed. Using the "Full Distance Data" object from the mc.tsp() function, I can assess the variance in the performances of each method and more accurately compare them.

```{r}
perf <- tst$`Full Distance Data`

ggplot(perf, aes(method, value))+
  geom_point(position = position_jitter(width = 0.1))+
  stat_summary(geom = "crossbar",
               fun.data = "mean_sdl",
               fun.args = list(mult=1))+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))+
  xlab("Method") + ylab("Cumulative Distance Values")+
  ggtitle("Comparison of TSP heuristics over multiple simulations")+
  labs(caption = "Crossbars represent mean ± standard deviation")
```

From this plot, we can assess the variance and any skewedness of the methods. There are some interesting findings. For example, while arbitrary insertion will consistently perform the best, it has a relatively high variance compared to farthest insertion. Thus, depending on the number of simulations one is able to run, farthest insertion may provide better results. The two opt method also has a high variance (higher than arbitrary insertion) but is able to yield some of the lowest distances. Cheapest insertion is not normally distributed showing a strong cluster of outcomes right around the mean and a few that drop well below two standard deviations. 

Interestingly, nearest neighbors has what appears to be a bimodal distribution of outcomes with a lack of values clustered around its mean. Moreover, its best performance aligns with the performance of repetitive nearest neighbors suggesting that it cannot outperform the repetitive nearest neighbors approach but only match it. Interestingly, repetitive nearest neighbors yields the same value each time it is used.

```{r}
ggplot(perf, aes(as.integer(iteration), value, group = method, color = method))+
  geom_point()+
  geom_line()+
  facet_wrap(~method)+
  theme_bw()+
  ggtitle("There is no stepwise improvement of heuristic performance over the iterations")+
  xlab("Iteration") + ylab("Cumulative Distance Values")
```

These heuristics do not appear to improve their performance over successive iterations. Logically, this makes sense as there is no reward function that was implemented. Instead, each heuristic was just run repeatedly and the unbiased best performance can be selected from the outcomes.

### Conclusion

TSP heuristics can be computationally cheap and fast methods to solve TSP problems with a low to moderate number of locations. Here, I apply a Monte Carlo approach to iterate TSP heuristics and select the best performance from among them. This method can be used to compare the best performances across several TSP heuristics and can generate an optimal route.

This method can be used with any number of cities and with any starting/ending points. While I did not provide an example, any locations for which a distance matrix can be generated and geographic coordinates can be determined are suitable.

One limitation is these data use straight line distances between cities and does not account for actual road distance or other factors such as incline, etc. To note, this method could be applied to more sophisticated distance data such as that from the ggmap package. This would allow for the generation of distance matrices based on actual road distance or on other factors such as average travel time. Another limitation is the lack of the _Concorde_ method in this report. Future versions of this report will compare these heuristics to the _Concorde_ method as a way to determine the absolute performance of these heuristics compared to an exact algorithm.

### References

https://en.wikipedia.org/wiki/Travelling_salesman_problem#As_a_graph_problem

https://en.wikipedia.org/wiki/Monte_Carlo_method

Hahsler M, Hornik K (2007). “TSP - Infrastructure for the traveling salesperson problem.” _Journal of Statistical Software_,
*23*(2), 1-21. ISSN 1548-7660, doi: 10.18637/jss.v023.i02 (URL: https://doi.org/10.18637/jss.v023.i02), <URL:
http://www.jstatsoft.org/v23/i02/>.

https://github.com/dkahle/ggmap