Transformer Architecture and Types
Focusing on Encoder-Only, Decoder-Only, and Encoder-Decoder Architectures
By Swadesh Swain
Date: June 26, 2024
Introduction
Overview of Transformer Architectures:
- Encoder-Only: Models using only the encoder stack.
- Decoder-Only: Models using only the decoder stack.
- Encoder-Decoder: Models using both encoder and decoder stacks.
Encoder-Only Architecture
Examples: BERT, RoBERTa, ALBERT
Structure:
- Consists of multiple layers of the encoder.
- Each layer includes self-attention and feed-forward networks.
Attention Mechanisms:
Self-Attention:
- Purpose: Allows the model to weigh the importance of different words in a sequence relative to each other.
- How it Works:
- Each word (token) in the input sequence generates three vectors: Query (Q), Key (K), and Value (V).
- The attention score is computed using the dot product of the Query with all Keys, followed by a softmax operation to obtain attention weights.
- The final output is a weighted sum of the Values based on these attention weights.
- Use Cases: Used in both encoder and decoder stacks in transformer models.
Positional Encoding: Adds positional information to the input embeddings.
Strengths and Use Cases:
- Text Classification: Understanding and categorizing input text.
- Named Entity Recognition (NER): Identifying and classifying entities in text.
- Question Answering (QA): Extracting answers from text based on a query.
- Masked Language Modeling (MLM): Predicting masked tokens to capture bidirectional context.
Decoder-Only Architecture
Examples: GPT, GPT-2, GPT-3
Structure:
- Consists of multiple layers of the decoder.
- Each layer includes masked self-attention and feed-forward networks.
Attention Mechanisms:
Masked Self-Attention:
- Purpose: Prevents the model from accessing future tokens in the sequence during training, ensuring the autoregressive property.
- How it Works:
- Similar to self-attention, but the attention score calculation includes a mask that sets the attention weights of future tokens to negative infinity, ensuring they don't contribute to the output.
- Use Cases: Used in the decoder stack for tasks requiring sequential generation (e.g., text generation).
Strengths and Use Cases:
- Text Generation: Creating coherent text based on a prompt.
- Language Modeling: Predicting the next word in a sequence.
- Autoregressive Tasks: Generating sequences one token at a time.
Encoder-Decoder Architecture
Examples: Original Transformer, T5, BART
Structure:
- Includes both an encoder and a decoder stack.
- The encoder processes the input sequence, and the decoder generates the output sequence.
Attention Mechanisms:
Self-Attention (Encoder)
Masked Self-Attention (Decoder)
Cross-Attention (Encoder-Decoder Attention):
- Purpose: Allows the decoder to focus on relevant parts of the encoded input sequence when generating each token of the output.
- How it Works:
- The decoder generates Query vectors, and the encoder provides Key and Value vectors.
- The attention mechanism works similarly to self-attention but between the decoder's queries and the encoder's keys and values.
- Use Cases: Essential for tasks requiring understanding of the input sequence to generate related output (e.g., translation, summarization).
Strengths and Use Cases:
- Machine Translation: Translating text from one language to another.
- Text Summarization: Condensing long texts into summaries.
- Text Generation with Context: Generating responses or content based on the input context.
Applications of Each Architecture
Encoder-Only:
- Sentiment Analysis: Determining the sentiment of text.
- Named Entity Recognition (NER): Identifying entities like names, dates, and locations.
- Question Answering (QA): Answering questions based on a given text.
Decoder-Only:
- Story Writing: Generating creative stories.
- Dialogue Generation: Creating conversational agents.
Encoder-Decoder:
- Translation: Converting text between languages.
- Summarization: Summarizing articles or documents.
- Conversational Agents: Generating context-aware responses in dialogue systems.
Conclusion
Comparison: Decoder-only vs Normal Transformers vs Encoder-Only
| Decoder-only Transformers | Normal Transformers | Encoder-only Transformers |
|---|---|---|
| A Decoder-Only Transformer has a single unit for both encoding the input and generating the output. | A normal Transformer uses one unit to encode the input, called the Encoder, and a separate unit to generate the output, called the Decoder. | An Encoder-Only Transformer has a single unit for processing and encoding the input, without a separate generation component. |
| A Decoder-Only Transformer uses a single type of attention, Masked Self-Attention | A normal Transformer uses two types of Attention during inference: Self-Attention and Encoder-Decoder Attention. | An Encoder-Only Transformer uses only Self-Attention, allowing each token to attend to all other tokens in the input. |
| A Decoder-Only Transformer uses Masked Self-Attention all the time on everything, the input and the output | During Training, a normal Transformer uses Masked Self-Attention, but only on the output. | An Encoder-Only Transformer uses unmasked Self-Attention throughout, as it doesn't generate sequential outputs. |
| Unidirectional attention (can only look at previous tokens) | Bidirectional attention in encoder (can look at entire input) | Bidirectional attention (can look at entire input in all layers) |
| Suitable for text generation tasks | Suitable for various tasks including translation and summarization | Suitable for tasks that require understanding of input, such as classification and feature extraction |
| Generally faster inference due to simpler architecture | More versatile but potentially slower due to encoder-decoder structure | Efficient for tasks that don't require text generation, as it processes input in parallel |
Recap:
- Encoder-Only: Best for understanding and classifying text.
- Decoder-Only: Best for generating text.
- Encoder-Decoder: Best for tasks requiring both understanding and generating text.
Future Directions:
- Efficiency Improvements: Research on making transformer models more efficient.
- Interpretability: Efforts to make model decisions more interpretable.
- Domain Adaptation: Enhancing models for specific domains or tasks.
Thank You
Thank you for your attention!
For further questions or discussions: swadeshswain226@gmail.com