Scaling NLP algorithms to meet high demand IEEE Conference Publication
Instead of embedding having to represent the absolute position of a word, Transformer XL uses an embedding to encode the relative distance between the words. This embedding is used to compute the attention score between any 2 words that could be separated by n words before or after. Transformer architectures were supported from GPT onwards […]