# TSP From DP to Deep Learning. Episode 3: Pointer Network

This is third episode of series: TSP From DP to Deep Learning. In this episode, we will be entering the realm of deep learning, specifically, a type of sequence-to-sequence called Pointer Networks is introduced. It is tailored to solve problems like TSP or Convex Hull. Full list of this series is listed below.

## Pointer Networks

In traditional sequence-to-sequence RNN, output classes depend on pre-defined size. For instance, a word generating RNN will utter one word from vocabulary of $|V|$ size at each time step. However, there is large set of problems such as Convex Hull, Delaunay Triangulation and TSP, where range of the each output is not pre-defined, but of variable size, defined by the input. Pointer Networks (Citation: & al., ) , & (). Pointer Networks. Retrieved from https://arxiv.org/abs/1506.03134 overcame the constraint by selecting $i$ -th input with probability derived from attention score.

### Convex Hull

In following example, 10 points are given, the output is a sequence of points that bounds the set of all points. Each value in the output sequence is a integer ranging from 1 to 10, in this case, which is the value given by the concrete example. Generally, finding exact solution has been proven to be equivelent to sort problem, and has time complexity $O(n*log(n))$.

\begin{align*} &\text{Input: } \mathcal{P} &=& \left\{{P_{1}, \ldots, P_{10}} \right\} \\ &\text{Output: } C^{\mathcal{P}} &=& \{2,4,3,5,6,7,2\} \end{align*}

### TSP

TSP is almost identical to Convex Hull problem, though output sequence is of fixed length. In previous epsiode, we reduced from $O(n!)$ to $O(n^2*2^n)$.

\begin{align*} &\text{Input: } \mathcal{P} &= &\left\{{P_{1}, \ldots, P_{6}} \right\} \\ &\text{Output: } C^{\mathcal{P}} &=& \{1,3,2,4,5,6,1\} \end{align*}

### Delaunay Triangulation

A Delaunay triangulation for a set of points in a plane is a triangulation such that each circumcircle of every triangle is empty, meaning no point from $\mathcal{P}$ in its interior. This kind of problem outputs a sequence of sets, and each item in the set ranges from the input set $\mathcal{P}$.

\begin{align*} &\text{Input: } \mathcal{P} &=& \left\{{P_{1}, \ldots, P_{5}} \right\} \\ &\text{Output: } C^{\mathcal{P}} &=& \{(1,2,4),(1,4,5),(1,3,5),(1,2,3)\} \end{align*}

### Sequence-to-Sequence Model

Suppose now n is fixed. given a training pair, $(\mathcal{P}, C^{\mathcal{P}})$, the vanilla sequence-to-sequence model parameterized by $\theta$ computes the conditional probability.

$$$$p\left(\mathcal{C}^{\mathcal{P}} | \mathcal{P} ; \theta\right)=\prod_{i=1}^{m(\mathcal{P})} p\left(C_{i} | C_{1}, \ldots, C_{i-1}, \mathcal{P} ; \theta\right)$$$$ The parameters of the model are learnt by maximizing the conditional probabilities for the training set, i.e. $$$$\theta^{*}=\underset{\theta}{\arg \max } \sum_{\mathcal{P}, \mathcal{C}^{\mathcal{P}}} \log p\left(\mathcal{C}^{\mathcal{P}} | \mathcal{P} ; \theta\right)$$$$

### Content Based Input Attention

When attention is applied to vanilla sequence-to-sequence model, better result is obtained.

Let encoder and decoder states be $(e_{1}, \ldots, e_{n})$ and $(d_{1}, \ldots, d_{m(\mathcal{P})})$, respectively. At each output time $i$, compute the attention vector $d_i$ to be linear combination of $(e_{1}, \ldots, e_{n})$ with weights $(a_{1}^{i}, \ldots, a_{n}^{i})$ $$d_{i} = \sum_{j=1}^{n} a_{j}^{i} e_{j}$$

$(a_{1}^{i}, \ldots, a_{n}^{i})$ is softmax value of $(u_{1}^{i}, \ldots, u_{n}^{i})$ and $u_{j}^{i}$ can be considered as distance between $d_{i}$ and $e_{j}$. Notice that $v$, $W_1$, and $W_2$ are learnable parameters of the model.

$$\begin{eqnarray} u_{j}^{i} &=& v^{T} \tanh \left(W_{1} e_{j}+W_{2} d_{{i}}\right) \quad j \in(1, \ldots, n) \\ a_{j}^{i} &=& \operatorname{softmax}\left(u_{j}^{i}\right) \quad j \in(1, \ldots, n) \end{eqnarray}$$

### Pointer Networks

Pointer Networks does not blend the encoder state $e_j$ to propagate extra information to the decoder, but instead, use $u^i_j$ as pointers to the input element.

$$\begin{eqnarray*} u_{j}^{i} &=& v^{T} \tanh \left(W_{1} e_{j}+W_{2} d_{i}\right) \quad j \in(1, \ldots, n) \\ p\left(C_{i} | C_{1}, \ldots, C_{i-1}, \mathcal{P}\right) &=& \operatorname{softmax}\left(u^{i}\right) \end{eqnarray*}$$

## More on Attention

In FloydHub Blog - Attention Mechanism (Citation: , ) (). FloydHub Blog — Attention Mechanism. Retrieved from https://blog.floydhub.com/attention-mechanism/ , a clear and detailed explanation of difference and similarity between the classic first type of Attention, commonly referred to as Additive Attention by Dzmitry Bahdanau (Citation: & al., ) , & (). Neural Machine Translation by Jointly Learning to Align and Translate. Retrieved from https://arxiv.org/abs/1409.0473 and second classic type, known as Multiplicative Attention and proposed by Thang Luong (Citation: & al., ) , & (). Effective Approaches to Attention-based Neural Machine Translation. Retrieved from https://arxiv.org/abs/1508.04025 , is discussed.

It’s well known that in Luong Attention, three ways of alignment scoring function is defined, or the distance between $d_{i}$ and $e_{j}$.

$$\operatorname{score} \left( d_i, e_j \right)= \begin{cases} d_i^{\top} e_j & \text { dot } \\ d_i^{\top} W_a e_j & \text { general } \\ v_a^{\top} \tanh \left( W_a \left[ d_i ; e_j \right] \right) & \text { concat } \end{cases}$$

## PyTorch Implementation

In episode 2 , we have introduced TSP dataset where each case is a line, of following form.

x0, y0, x1, y1, ... output 1 v1 v2 v3 ... 1


### PyTorch Dataset

Each case is converted to (input, input_len, output_in, output_out, output_len) of type nd.ndarray with appropriate padding and encapsulated in a extended PyTorch Dataset.

  1 2 3 4 5 6 7 8 9 10 11 12  from torch.utils.data import Dataset class TSPDataset(Dataset): "each data item of form (input, input_len, output_in, output_out, output_len)" data: List[Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]] def __len__(self): return len(self.data) def __getitem__(self, index): input, input_len, output_in, output_out, output_len = self.data[index] return input, input_len, output_in, output_out, output_len 

Code in PyTorch seq-to-seq model typically utilizes pack_padded_sequence and pad_packed_sequence API to reduce computational cost. A detailed explanation is given here https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning#decoder-1.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  class RNNEncoder(nn.Module): rnn: Union[nn.LSTM, nn.GRU, nn.RNN] def __init__(self, rnn_type: str, bidirectional: bool, num_layers: int, input_size: int, hidden_size: int, dropout: float): super(RNNEncoder, self).__init__() if bidirectional: assert hidden_size % 2 == 0 hidden_size = hidden_size // 2 self.rnn = rnn_init(rnn_type, input_size=input_size, hidden_size=hidden_size, bidirectional=bidirectional,num_layers=num_layers, dropout=dropout) def forward(self, src: Tensor, src_lengths: Tensor, hidden: Tensor = None) -> Tuple[Tensor, Tensor]: lengths = src_lengths.view(-1).tolist() packed_src = pack_padded_sequence(src, lengths) memory_bank, hidden_final = self.rnn(packed_src, hidden) memory_bank = pad_packed_sequence(memory_bank)[0] return memory_bank, hidden_final 

### Attention Code

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  class Attention(nn.Module): linear_out: nn.Linear def __init__(self, dim: int): super(Attention, self).__init__() self.linear_out = nn.Linear(dim * 2, dim, bias=False) def score(self, src: Tensor, target: Tensor) -> Tensor: batch_size, src_len, dim = src.size() _, target_len, _ = target.size() target_ = target src_ = src.transpose(1, 2) return torch.bmm(target_, src_) def forward(self, src: Tensor, target: Tensor, src_lengths: Tensor) -> Tuple[Tensor, Tensor]: assert target.dim() == 3 batch_size, src_len, dim = src.size() _, target_len, _ = target.size() align_score = self.score(src, target) mask = sequence_mask(src_lengths) # (batch_size, max_len) -> (batch_size, 1, max_len) mask = mask.unsqueeze(1) align_score.data.masked_fill_(~mask, -float('inf')) align_score = F.softmax(align_score, -1) c = torch.bmm(align_score, src) concat_c = torch.cat([c, target], -1) attn_h = self.linear_out(concat_c) return attn_h, align_score 

Complete PyTorch implementation source code is also available on github .

Gabriel (2019)