kshama/million-param.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Created by [Fareed Khan](https://www.linkedin.com/in/fareed-khan-dev/)\n",
    "\n",
    "If you find this notebook helpful, there's no \"like\" option, but you can connect with me on [LinkedIn](https://www.linkedin.com/in/fareed-khan-dev/) or follow me on [Medium](https://medium.com/@fareedkhandev)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.3M Params LLM From Scratch Python\n",
    "\n",
    "<!-- Logo Image -->\n",
    "<img src=\"https://i.ibb.co/r56NHtM/1-ox3h-To-PFUWx-Aw-URx-YEXi-Gg-removebg-preview.png\" alt=\"Cropped Image\">\n",
    "\n",
    "Making your own Large Language Model (LLM) is a cool thing that many big companies like Google, Twitter, and Facebook are doing. They release different versions of these models, like 7 billion, 13 billion, or 70 billion. Even smaller communities are doing it too. You might have read blogs or watched videos on creating your own LLM, but they usually talk a lot about theory and not so much about the actual steps and code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "\n",
    "- Prerequisites\n",
    "- Understanding the Transformer Architecture of LLaMA\n",
    "  - Pre-normalization Using RMSNorm\n",
    "  - SwiGLU Activation Function\n",
    "  - Rotary Embeddings (RoPE)\n",
    "- Setting the Stage\n",
    "- Data Preprocessing\n",
    "- Evaluation Strategy\n",
    "- Setting Up a Base Neural Network Model\n",
    "- Replicating LLaMA Architecture\n",
    "  - RMSNorm for pre-normalization\n",
    "  - Rotary Embeddings\n",
    "  - SwiGLU activation function\n",
    "- Experimenting with hyperparameters\n",
    "- Conclusion\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prerequisites\n",
    "\n",
    "Make sure you have a basic understanding of object-oriented programming (**OOP**) and neural networks (**NN**). Familiarity with **PyTorch** will also be helpful in coding.\n",
    "\n",
    "| Topic               | Video Link                                                |\n",
    "|---------------------|-----------------------------------------------------------|\n",
    "| OOP                 | [OOP Video](https://www.youtube.com/watch?v=Ej_02ICOIgs&pp=ygUKb29wIHB5dGhvbg%3D%3D) |\n",
    "| Neural Network      | [Neural Network Video](https://www.youtube.com/watch?v=Jy4wM2X21u0&pp=ygUbbmV1cmFsIG5ldHdvcmsgcHl0aG9uIHRvcmNo) |\n",
    "| Pytorch             | [Pytorch Video](https://www.youtube.com/watch?v=V_xro1bcAuA&pp=ygUbbmV1cmFsIG5ldHdvcmsgcHl0aG9uIHRvcmNo) |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding the Transformer Architecture of LLaMA\n",
    "\n",
    "Before diving into creating our own LLM using the LLaMA approach, it’s essential to understand the architecture of LLaMA. Below is a comparison diagram between the vanilla transformer and LLaMA.\n",
    "\n",
    "<img src=\"https://cdn-images-1.medium.com/max/25620/1*nt-ydHhSVsaLXq_HZRaLQA.png\" alt=\"Difference between Transformers and Llama architecture (Llama architecture by Umar Jamil)\" style=\"width: 50%;\">\n",
    "(Llama architecture by Umar Jamil)\n",
    "\n",
    "In case you’re not familiar with the vanilla transformer architecture, you can read [this blog](https://medium.com/@fareedkhandev/understanding-transformers-a-step-by-step-math-example-part-1-a7809015150a) for a basic guide.\n",
    "\n",
    "Let’s look into the essential concepts of LLaMA with a bit more detail:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pre-normalization Using RMSNorm:\n",
    "\n",
    "In the LLaMA approach, a technique called RMSNorm is employed for normalizing the input of each transformer sub-layer. This method is inspired by GPT-3 and is designed to optimize the computational cost associated with Layer Normalization. RMSNorm provides similar performance to LayerNorm but reduces the running time significantly (by 7%∼64%).\n",
    "\n",
    "<img src=\"https://cdn-images-1.medium.com/max/3604/1*9FA6P93WhRuWFXxVlPG3LA.png\" alt=\"Root Mean Square Layer Normalization Paper\" style=\"width: 50%;\">\n",
    "\n",
    "\n",
    "It achieves this by emphasizing re-scaling invariance and regulating the summed inputs based on the root mean square (RMS) statistic. The primary motivation is to simplify LayerNorm by removing the mean statistic. Interested readers can explore the detailed implementation of RMSNorm [here](https://github.com/bzhangGo/rmsnorm/blob/master/rmsnorm_torch.py)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SwiGLU Activation Function:\n",
    "\n",
    "LLaMA introduces the SwiGLU activation function, drawing inspiration from PaLM. To understand SwiGLU, it’s essential to first grasp the Swish activation function. SwiGLU extends Swish and involves a custom layer with a dense network to split and multiply input activations.\n",
    "\n",
    "<img src=\"https://cdn-images-1.medium.com/max/13536/1*N3dwnqNUD0TdwPYO0NlhYg.png\" alt=\"SwiGLU: GLU Variants Improve Transformer\" style=\"width: 50%;\">\n",
    "\n",
    "The aim is to enhance the expressive power of the model by introducing a more sophisticated activation function. Further details on SwiGLU can be found in the associated [paper](https://arxiv.org/pdf/2002.05202v1.pdf)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rotary Embeddings (RoPE):\n",
    "\n",
    "Rotary Embeddings, or RoPE, is a type of position embedding used in LLaMA. It encodes absolute positional information using a rotation matrix and naturally includes explicit relative position dependency in self-attention formulations. RoPE offers advantages such as scalability to various sequence lengths and decaying inter-token dependency with increasing relative distances.\n",
    "\n",
    "This is achieved by encoding relative positions through multiplication with a rotation matrix, resulting in decayed relative distances — a desirable feature for natural language encoding. Those interested in the mathematical details can refer to the [RoPE paper](https://arxiv.org/pdf/2104.09864v4.pdf).\n",
    "\n",
    "In addition to these concepts, the LLaMA paper introduces other significant approaches, including the use of the **AdamW optimizer** with specific parameters, efficient implementations such as the causal [multi-head attention operator](https://facebookresearch.github.io/xformers/components/mha.html) available in the xformers library, and manually implemented backward functions for transformer layers to optimize computation during backward passes.\n",
    "\n",
    "A special acknowledgment and thanks to [Anush Kumar](https://akgeni.medium.com/) for providing an in-depth explanation of each crucial aspect of LLaMA."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting the Stage\n",
    "\n",
    "We’ll be working with a range of Python libraries throughout this project, so let’s import them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# PyTorch for implementing LLM (No GPU)\n",
    "import torch\n",
    "\n",
    "# Neural network modules and functions from PyTorch\n",
    "from torch import nn\n",
    "from torch.nn import functional as F\n",
    "\n",
    "# NumPy for numerical operations\n",
    "import numpy as np\n",
    "\n",
    "# Matplotlib for plotting Loss etc.\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "# Time module for tracking execution time\n",
    "import time\n",
    "\n",
    "# Pandas for data manipulation and analysis\n",
    "import pandas as pd\n",
    "\n",
    "# urllib for handling URL requests (Downloading Dataset)\n",
    "import urllib.request"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Furthermore, I’m creating a configuration object that stores model parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configuration object for model parameters\n",
    "MASTER_CONFIG = {\n",
    "    # Adding parameters later\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This approach maintains flexibility, allowing for the addition of more parameters as needed in the future."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preprocessing\n",
    "\n",
    "In the original LLaMA paper, diverse open-source datasets were employed to train and evaluate the model.\n",
    "\n",
    "<img src=\"https://cdn-images-1.medium.com/max/2304/1*vcZOIbZVutELPXNrtAVSdg.png\" alt=\"LLaMA Open and Efficient Foundation Language Models\" width=\"50%\">\n",
    "\n",
    "\n",
    "Unfortunately, utilizing extensive datasets may be impractical for smaller projects. Therefore, for our implementation, we’ll take a more modest approach by creating a dramatically scaled-down version of LLaMA.\n",
    "\n",
    "Given the constraints of not having access to vast amounts of data, we will focus on training a simplified version of LLaMA using the TinyShakespeare dataset. This open source dataset, available [here](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt), contains approximately 40,000 lines of text from various Shakespearean works. This choice is influenced by the [Makemore series by Karpathy](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ), which provides valuable insights into training language models.\n",
    "\n",
    "While LLaMA was trained on an extensive dataset comprising **1.4 trillion** tokens, our dataset, TinyShakespeare, containing around **1 million characters**.\n",
    "\n",
    "First, let’s obtain our dataset by downloading it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The URL of the raw text file on GitHub\n",
    "url = \"https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\"\n",
    "\n",
    "# The file name for local storage\n",
    "file_name = \"tinyshakespeare.txt\"\n",
    "\n",
    "# Execute the download\n",
    "urllib.request.urlretrieve(url, file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This Python script fetches the tinyshakespeare dataset from the specified URL and saves it locally with the filename **“tinyshakespeare.txt.”**\n",
    "\n",
    "Next, let’s determine the vocabulary size, which represents the unique number of characters in our dataset. Here’s the code snippet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the content of the dataset\n",
    "lines = open(\"tinyshakespeare.txt\", 'r').read()\n",
    "\n",
    "# Create a sorted list of unique characters in the dataset\n",
    "vocab = sorted(list(set(lines)))\n",
    "\n",
    "# Display the first 10 characters in the vocabulary list\n",
    "print('Printing the first 10 characters of the vocab list:', vocab[:10])\n",
    "\n",
    "# Output the total number of characters in our dataset (Vocabulary Size)\n",
    "print('Total number of characters in our dataset (Vocabulary Size):', len(vocab))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/52186/1*No8_KIyXjY7gzvIMf2t9hg.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "Now, we’re creating mappings between integers to characters (**itos**) and characters to integers (**stoi**). Here’s the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mapping integers to characters (itos)\n",
    "itos = {i: ch for i, ch in enumerate(vocab)}\n",
    "\n",
    "# Mapping characters to integers (stoi)\n",
    "stoi = {ch: i for i, ch in enumerate(vocab)}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/52186/1*F3J_886M-k9pvAjUUOXlxA.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "In the original LLaMA paper, the [SentencePiece byte-pair encoding tokenizer](https://github.com/google/sentencepiece) from Google was used. However, for simplicity, we’ll opt for a basic character-level tokenizer. Let’s create encode and decode functions that we’ll later apply to our dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Encode function: Converts a string to a list of integers using the mapping stoi\n",
    "def encode(s):\n",
    "    return [stoi[ch] for ch in s]\n",
    "\n",
    "# Decode function: Converts a list of integers back to a string using the mapping itos\n",
    "def decode(l):\n",
    "    return ''.join([itos[i] for i in l])\n",
    "\n",
    "# Example: Encode the string \"hello\" and then decode the result\n",
    "decode(encode(\"morning\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The final line will output `morning` confirms the proper functionality of the encode and decode functions.\n",
    "\n",
    "We are now converting our dataset into a torch tensor, specifying its data type for further operations using **PyTorch**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert the dataset into a torch tensor with specified data type (dtype)\n",
    "dataset = torch.tensor(encode(lines), dtype=torch.int8)\n",
    "\n",
    "# adding the vocab size\n",
    "MASTER_CONFIG = {\n",
    "    \"vocab_size\": len(vocab),\n",
    "}\n",
    "\n",
    "# Display the shape of the resulting tensor\n",
    "print(dataset.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output is `torch.Size([1115394])` indicates that our dataset contains approximately **one million tokens**. It's worth noting that this is significantly smaller than the LLaMA dataset, which consists of **1.4 trillion tokens**.\n",
    "\n",
    "We’ll create a function responsible for splitting our dataset into training, validation, or test sets. In machine learning or deep learning projects, such splits are crucial for developing and evaluating models, and the same principle applies here in replicating a Large Language Model (LLM) approach:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to get batches for training, validation, or testing\n",
    "def get_batches(data, split, batch_size, context_window, config=MASTER_CONFIG):\n",
    "    # Split the dataset into training, validation, and test sets\n",
    "    train = data[:int(.8 * len(data))]\n",
    "    val = data[int(.8 * len(data)): int(.9 * len(data))]\n",
    "    test = data[int(.9 * len(data)):]\n",
    "\n",
    "    # Determine which split to use\n",
    "    batch_data = train\n",
    "    if split == 'val':\n",
    "        batch_data = val\n",
    "    if split == 'test':\n",
    "        batch_data = test\n",
    "\n",
    "    # Pick random starting points within the data\n",
    "    ix = torch.randint(0, batch_data.size(0) - context_window - 1, (batch_size,))\n",
    "\n",
    "    # Create input sequences (x) and corresponding target sequences (y)\n",
    "    x = torch.stack([batch_data[i:i+context_window] for i in ix]).long()\n",
    "    y = torch.stack([batch_data[i+1:i+context_window+1] for i in ix]).long()\n",
    "\n",
    "    return x, y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that our splitting function is defined, let’s establish two parameters crucial for this process:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update the MASTER_CONFIG with batch_size and context_window parameters\n",
    "MASTER_CONFIG.update({\n",
    "    'batch_size': 8,          # Number of batches to be processed at each random split\n",
    "    'context_window': 16      # Number of characters in each input (x) and target (y) sequence of each batch\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "batch_size determines how many batches are processed at each random split, while context_window specifies the number of characters in each input (x) and target (y) sequence of each batch.\n",
    "\n",
    "Let’s print a random sample from the train split of batch 8 and context window 16 from our dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Obtain batches for training using the specified batch size and context window\n",
    "xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])\n",
    "\n",
    "# Decode the sequences to obtain the corresponding text representations\n",
    "decoded_samples = [(decode(xs[i].tolist()), decode(ys[i].tolist())) for i in range(len(xs))]\n",
    "\n",
    "# Print the random sample\n",
    "print(decoded_samples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/14848/1*lv5ckp3X_s2qrVMW1QVkPg.png\" alt=\"\" width=\"50%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation Strategy\n",
    "\n",
    "Now, we are set to create a function dedicated to evaluating our self-created LLaMA architecture. The reason for doing this before defining the actual model approach is to enable continuous evaluation during the training process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@torch.no_grad()  # Don't compute gradients for this function\n",
    "def evaluate_loss(model, config=MASTER_CONFIG):\n",
    "    # Placeholder for the evaluation results\n",
    "    out = {}\n",
    "    \n",
    "    # Set the model to evaluation mode\n",
    "    model.eval()\n",
    "\n",
    "    # Iterate through training and validation splits\n",
    "    for split in [\"train\", \"val\"]:\n",
    "        # Placeholder for individual losses\n",
    "        losses = []\n",
    "\n",
    "        # Generate 10 batches for evaluation\n",
    "        for _ in range(10):\n",
    "            # Get input sequences (xb) and target sequences (yb)\n",
    "            xb, yb = get_batches(dataset, split, config['batch_size'], config['context_window'])\n",
    "            \n",
    "            # Perform model inference and calculate the loss\n",
    "            _, loss = model(xb, yb)\n",
    "            \n",
    "            # Append the loss to the list\n",
    "            losses.append(loss.item())\n",
    "\n",
    "        # Calculate the mean loss for the split and store it in the output dictionary\n",
    "        out[split] = np.mean(losses)\n",
    "    \n",
    "    # Set the model back to training mode\n",
    "    model.train()\n",
    "    \n",
    "    return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have used the **loss** as a metric to assess the performance of the model during training iterations. Our function iterates through the training and validation splits, computes the mean loss over 10 batches for each split, and finally returns the results. The model is then set back to training mode with model.train().\n",
    "\n",
    "## Setting Up a Base Neural Network Model\n",
    "\n",
    "We’re building a basic neural network that we’ll improve later using LLaMA techniques."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Definition of a basic neural network class\n",
    "class SimpleBrokenModel(nn.Module):\n",
    "    def __init__(self, config=MASTER_CONFIG):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "\n",
    "        # Embedding layer to convert character indices to vectors (vocab size: 65)\n",
    "        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])\n",
    "\n",
    "        # Linear layers for modeling relationships between features\n",
    "        # (to be updated with SwiGLU activation function as in LLaMA)\n",
    "        self.linear = nn.Sequential(\n",
    "            nn.Linear(config['d_model'], config['d_model']),\n",
    "            nn.ReLU(),  # Currently using ReLU, will be replaced with SwiGLU as in LLaMA\n",
    "            nn.Linear(config['d_model'], config['vocab_size']),\n",
    "        )\n",
    "\n",
    "        # Print the total number of model parameters\n",
    "        print(\"Model parameters:\", sum([m.numel() for m in self.parameters()]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the current architecture, the embedding layer has a vocabulary size of 65, representing the characters in our dataset. As this serves as our base model, we are using **ReLU **as the activation function in the linear layers; however, this will later be replaced with SwiGLU, as used in LLaMA.\n",
    "\n",
    "To create a forward pass for our base model, we must define a forward function within our NN model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Definition of a basic neural network class\n",
    "class SimpleBrokenModel(nn.Module):\n",
    "    def __init__(self, config=MASTER_CONFIG):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "\n",
    "        # Embedding layer to convert character indices to vectors (vocab size: 65)\n",
    "        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])\n",
    "\n",
    "        # Linear layers for modeling relationships between features\n",
    "        # (to be updated with SwiGLU activation function as in LLaMA)\n",
    "        self.linear = nn.Sequential(\n",
    "            nn.Linear(config['d_model'], config['d_model']),\n",
    "            nn.ReLU(),  # Currently using ReLU, will be replaced with SwiGLU as in LLaMA\n",
    "            nn.Linear(config['d_model'], config['vocab_size']),\n",
    "        )\n",
    "\n",
    "        # Print the total number of model parameters\n",
    "        print(\"Model parameters:\", sum([m.numel() for m in self.parameters()]))\n",
    "\n",
    "    # Forward pass function for the base model\n",
    "    def forward(self, idx, targets=None):\n",
    "        # Embedding layer converts character indices to vectors\n",
    "        x = self.embedding(idx)\n",
    "        \n",
    "        # Linear layers for modeling relationships between features\n",
    "        a = self.linear(x)\n",
    "        \n",
    "        # Apply softmax activation to obtain probability distribution\n",
    "        logits = F.softmax(a, dim=-1)\n",
    "\n",
    "        # If targets are provided, calculate and return the cross-entropy loss\n",
    "        if targets is not None:\n",
    "            # Reshape logits and targets for cross-entropy calculation\n",
    "            loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))\n",
    "            return logits, loss\n",
    "\n",
    "        # If targets are not provided, return the logits\n",
    "        else:\n",
    "            return logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This forward pass function takes character indices (idx) as input, applies the embedding layer, passes the result through linear layers, applies a softmax activation to obtain a probability distribution (logits). If targets are provided, it calculates the cross-entropy loss and returns both logits and loss. If targets are not provided, it returns only the logits.\n",
    "\n",
    "To instantiate this model, we can directly invoke the class and print the total number of parameters in our Simple Neural Network Model. We’ve set the dimension of our linear layers to 128, specifying this value in our config object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update MASTER_CONFIG with the dimension of linear layers (128)\n",
    "MASTER_CONFIG.update({\n",
    "    'd_model': 128,\n",
    "})\n",
    "\n",
    "# Instantiate the SimpleBrokenModel using the updated MASTER_CONFIG\n",
    "model = SimpleBrokenModel(MASTER_CONFIG)\n",
    "\n",
    "# Print the total number of parameters in the model\n",
    "print(\"Total number of parameters in the Simple Neural Network Model:\", sum([m.numel() for m in model.parameters()]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/56088/1*8k4YL-ZFXGHFgfu_-PMaPg.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "Our Simple Neural Network Model comprises approximately 33,000 parameters.\n",
    "\n",
    "Similarly, to compute logits and loss, we only need to feed our split dataset into our model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Obtain batches for training using the specified batch size and context window\n",
    "xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])\n",
    "\n",
    "# Calculate logits and loss using the model\n",
    "logits, loss = model(xs, ys)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To train our base model and note its performance, we need to specify some parameters. We are training for a total of 1000 epochs. Increasing the batch size to 32 from 8, and set the log_interval to 10, indicating that the code will print or log information about the training progress every 10 batches. For optimization, we’ll use the Adam optimizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update MASTER_CONFIG with training parameters\n",
    "MASTER_CONFIG.update({\n",
    "    'epochs': 1000,          # Number of training epochs\n",
    "    'log_interval': 10,      # Log information every 10 batches during training\n",
    "    'batch_size': 32,        # Increase batch size to 32\n",
    "})\n",
    "\n",
    "# Instantiate the SimpleBrokenModel with updated configuration\n",
    "model = SimpleBrokenModel(MASTER_CONFIG)\n",
    "\n",
    "# Define the Adam optimizer for model parameters\n",
    "optimizer = torch.optim.Adam(\n",
    "    model.parameters(),      # Pass the model parameters to the optimizer\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s execute the training process and capture the loss from our base model, including the total number of parameters. **Additionally, each line is commented for clarity**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to perform training\n",
    "def train(model, optimizer, scheduler=None, config=MASTER_CONFIG, print_logs=False):\n",
    "    # Placeholder for storing losses\n",
    "    losses = []\n",
    "    \n",
    "    # Start tracking time\n",
    "    start_time = time.time()\n",
    "\n",
    "    # Iterate through epochs\n",
    "    for epoch in range(config['epochs']):\n",
    "        # Zero out gradients\n",
    "        optimizer.zero_grad()\n",
    "\n",
    "        # Obtain batches for training\n",
    "        xs, ys = get_batches(dataset, 'train', config['batch_size'], config['context_window'])\n",
    "\n",
    "        # Forward pass through the model to calculate logits and loss\n",
    "        logits, loss = model(xs, targets=ys)\n",
    "\n",
    "        # Backward pass and optimization step\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "\n",
    "        # If a learning rate scheduler is provided, adjust the learning rate\n",
    "        if scheduler:\n",
    "            scheduler.step()\n",
    "\n",
    "        # Log progress every specified interval\n",
    "        if epoch % config['log_interval'] == 0:\n",
    "            # Calculate batch time\n",
    "            batch_time = time.time() - start_time\n",
    "            \n",
    "            # Evaluate loss on validation set\n",
    "            x = evaluate_loss(model)\n",
    "            \n",
    "            # Store the validation loss\n",
    "            losses += [x]\n",
    "            \n",
    "            # Print progress logs if specified\n",
    "            if print_logs:\n",
    "                print(f\"Epoch {epoch} | val loss {x['val']:.3f} | Time {batch_time:.3f} | ETA in seconds {batch_time * (config['epochs'] - epoch)/config['log_interval'] :.3f}\")\n",
    "                \n",
    "            # Reset the timer\n",
    "            start_time = time.time()\n",
    "\n",
    "            # Print learning rate if a scheduler is provided\n",
    "            if scheduler:\n",
    "                print(\"lr: \", scheduler.get_lr())\n",
    "\n",
    "    # Print the final validation loss\n",
    "    print(\"Validation loss: \", losses[-1]['val'])\n",
    "    \n",
    "    # Plot the training and validation loss curves\n",
    "    return pd.DataFrame(losses).plot()\n",
    "\n",
    "# Execute the training process\n",
    "train(model, optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/17880/1*z1YNDzGegE2sg12SfkIZjw.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "The initial cross-entropy loss before training stands at 4.17, and after 1000 epochs, it reduces to 3.93. In this context, cross-entropy reflects the likelihood of selecting the incorrect word.\n",
    "\n",
    "Our model incorporates a softmax layer on the logits, which transforms a vector of numbers into a probability distribution. Let’s use the built-in F.cross_entropy function, we need to directly pass in the [unnormalized logits](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html). Consequently, we will modify our model accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Definition of a basic neural network class\n",
    "class SimpleBrokenModel(nn.Module):\n",
    "    def __init__(self, config=MASTER_CONFIG):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "\n",
    "        # Embedding layer to convert character indices to vectors (vocab size: 65)\n",
    "        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])\n",
    "\n",
    "        # Linear layers for modeling relationships between features\n",
    "        # (to be updated with SwiGLU activation function as in LLaMA)\n",
    "        self.linear = nn.Sequential(\n",
    "            nn.Linear(config['d_model'], config['d_model']),\n",
    "            nn.ReLU(),  # Currently using ReLU, will be replaced with SwiGLU as in LLaMA\n",
    "            nn.Linear(config['d_model'], config['vocab_size']),\n",
    "        )\n",
    "\n",
    "        # Print the total number of model parameters\n",
    "        print(\"Model parameters:\", sum([m.numel() for m in self.parameters()]))\n",
    "\n",
    "    # Forward pass function for the base model\n",
    "    def forward(self, idx, targets=None):\n",
    "        \n",
    "        # Embedding layer converts character indices to vectors\n",
    "        x = self.embedding(idx)\n",
    "        \n",
    "        # Linear layers for modeling relationships between features\n",
    "        logits = self.linear(x)\n",
    "\n",
    "        # If targets are provided, calculate and return the cross-entropy loss\n",
    "        if targets is not None:\n",
    "            # Reshape logits and targets for cross-entropy calculation\n",
    "            loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))\n",
    "            return logits, loss\n",
    "\n",
    "        # If targets are not provided, return the logits\n",
    "        else:\n",
    "            return logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s recreate the updated SimpleModel and train it for 1000 epochs to observe any changes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the updated SimpleModel\n",
    "model = SimpleBrokenModel(MASTER_CONFIG)\n",
    "\n",
    "# Obtain batches for training\n",
    "xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])\n",
    "\n",
    "# Calculate logits and loss using the model\n",
    "logits, loss = model(xs, ys)\n",
    "\n",
    "# Define the Adam optimizer for model parameters\n",
    "optimizer = torch.optim.Adam(model.parameters())\n",
    "\n",
    "# Train the model for 100 epochs\n",
    "train(model, optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/21656/1*90UeocQghByPqaZ_g6dYEw.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "After reducing the loss to 2.51, let’s explore how our language model with approximately **33,000 parameters** generates text during inferencing. We’ll create a ‘generate’ function, which we’ll later use when replicating LLaMA:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate function for text generation using the trained model\n",
    "def generate(model, config=MASTER_CONFIG, max_new_tokens=30):\n",
    "    idx = torch.zeros(5, 1).long()\n",
    "    for _ in range(max_new_tokens):\n",
    "        # Call the model\n",
    "        logits = model(idx[:, -config['context_window']:])\n",
    "        last_time_step_logits = logits[\n",
    "            :, -1, :\n",
    "        ]  # all the batches (1), last time step, all the logits\n",
    "        p = F.softmax(last_time_step_logits, dim=-1)  # softmax to get probabilities\n",
    "        idx_next = torch.multinomial(\n",
    "            p, num_samples=1\n",
    "        )  # sample from the distribution to get the next token\n",
    "        idx = torch.cat([idx, idx_next], dim=-1)  # append to the sequence\n",
    "    return [decode(x) for x in idx.tolist()]\n",
    "\n",
    "# Generate text using the trained model\n",
    "generate(model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/43492/1*D1X9nqb8gPN5tXZJL2eAiQ.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "The generated text doesn’t look great with our basic model of around 33K parameters. However, now that we’ve laid the groundwork with this simple model, we’ll move on to constructing the LLaMA architecture in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Replicating LLaMA Architecture\n",
    "\n",
    "In the earlier part of the blog, we covered essential concepts, and now, we’ll integrate these concepts into our base model. LLaMA introduces three architectural modifications to the original Transformer:\n",
    "\n",
    " 1. RMSNorm for pre-normalization\n",
    "\n",
    " 2. Rotary embeddings\n",
    "\n",
    " 3. SwiGLU activation function\n",
    "\n",
    "We’ll incorporate each of these modifications one by one into our base model, iterating and building upon them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RMSNorm for pre-normalization:\n",
    "\n",
    "We are defining an RMSNorm function with the following functionalities:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RMSNorm(nn.Module):\n",
    "    def __init__(self, layer_shape, eps=1e-8, bias=False):\n",
    "        super(RMSNorm, self).__init__()\n",
    "\n",
    "        # Registering a learnable parameter 'scale' as a parameter of the module\n",
    "        self.register_parameter(\"scale\", nn.Parameter(torch.ones(layer_shape)))\n",
    "\n",
    "    def forward(self, x):\n",
    "        \"\"\"\n",
    "        Assumes shape is (batch, seq_len, d_model)\n",
    "        \"\"\"\n",
    "        # Calculating the Frobenius norm, RMS = 1/sqrt(N) * Frobenius norm\n",
    "        ff_rms = torch.linalg.norm(x, dim=(1,2)) * x[0].numel() ** -.5\n",
    "\n",
    "        # Normalizing the input tensor 'x' with respect to RMS\n",
    "        raw = x / ff_rms.unsqueeze(-1).unsqueeze(-1)\n",
    "\n",
    "        # Scaling the normalized tensor using the learnable parameter 'scale'\n",
    "        return self.scale[:x.shape[1], :].unsqueeze(0) * raw"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we define the RMSNorm class. During initialization, it registers a scale parameter. In the forward pass, it calculates the **Frobenius norm** of the input tensor and then normalizes the tensor. Finally, the tensor is scaled by the registered scale parameter. This function is designed for use in LLaMA to replace the LayerNorm operation.\n",
    "\n",
    "Now it’s time to incorporate the first implementation concept of LLaMA, which is RMSNorm, into our simple NN model. Here’s the updated code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the SimpleModel_RMS with RMSNorm\n",
    "class SimpleModel_RMS(nn.Module):\n",
    "    def __init__(self, config):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "\n",
    "        # Embedding layer to convert character indices to vectors\n",
    "        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])\n",
    "\n",
    "        # RMSNorm layer for pre-normalization\n",
    "        self.rms = RMSNorm((config['context_window'], config['d_model']))\n",
    "\n",
    "        # Linear layers for modeling relationships between features\n",
    "        self.linear = nn.Sequential(\n",
    "            nn.Linear(config['d_model'], config['d_model']),\n",
    "            nn.ReLU(),  # Currently using ReLU, will be replaced with SwiGLU as in LLaMA\n",
    "            nn.Linear(config['d_model'], config['vocab_size']),\n",
    "        )\n",
    "\n",
    "        # Print the total number of model parameters\n",
    "        print(\"Model parameters:\", sum([m.numel() for m in self.parameters()]))\n",
    "\n",
    "    def forward(self, idx, targets=None):\n",
    "        # Embedding layer converts character indices to vectors\n",
    "        x = self.embedding(idx)\n",
    "\n",
    "        # RMSNorm pre-normalization\n",
    "        x = self.rms(x)\n",
    "\n",
    "        # Linear layers for modeling relationships between features\n",
    "        logits = self.linear(x)\n",
    "\n",
    "        # If targets are provided, calculate and return the cross-entropy loss\n",
    "        if targets is not None:\n",
    "            # Reshape logits and targets for cross-entropy calculation\n",
    "            loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))\n",
    "            return logits, loss\n",
    "\n",
    "        # If targets are not provided, return the logits\n",
    "        else:\n",
    "            return logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s execute the modified NN model with RMSNorm and observe the updated number of parameters in the model, along with the loss:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an instance of SimpleModel_RMS\n",
    "model = SimpleModel_RMS(MASTER_CONFIG)\n",
    "\n",
    "# Obtain batches for training\n",
    "xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])\n",
    "\n",
    "# Calculate logits and loss using the model\n",
    "logits, loss = model(xs, ys)\n",
    "\n",
    "# Define the Adam optimizer for model parameters\n",
    "optimizer = torch.optim.Adam(model.parameters())\n",
    "\n",
    "# Train the model\n",
    "train(model, optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The validation loss experiences a small decrease, and the parameters of our updated LLM now total approximately 35,000."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rotary Embeddings:\n",
    "\n",
    "Next, we will implement rotary positional embeddings. In RoPE, the authors suggest embedding the position of a token in a sequence by rotating the embedding, applying a different rotation at each position. Let’s create a function that mimics the actual paper implementation of RoPE:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_rotary_matrix(context_window, embedding_dim):\n",
    "    # Initialize a tensor for the rotary matrix with zeros\n",
    "    R = torch.zeros((context_window, embedding_dim, embedding_dim), requires_grad=False)\n",
    "    \n",
    "    # Loop through each position in the context window\n",
    "    for position in range(context_window):\n",
    "        # Loop through each dimension in the embedding\n",
    "        for i in range(embedding_dim // 2):\n",
    "            # Calculate the rotation angle (theta) based on the position and embedding dimension\n",
    "            theta = 10000. ** (-2. * (i - 1) / embedding_dim)\n",
    "            # Calculate the rotated matrix elements using sine and cosine functions\n",
    "            m_theta = position * theta\n",
    "            R[position, 2 * i, 2 * i] = np.cos(m_theta)\n",
    "            R[position, 2 * i, 2 * i + 1] = -np.sin(m_theta)\n",
    "            R[position, 2 * i + 1, 2 * i] = np.sin(m_theta)\n",
    "            R[position, 2 * i + 1, 2 * i + 1] = np.cos(m_theta)\n",
    "    return R"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we generate a rotary matrix based on the specified context window and embedding dimension, following the proposed RoPE implementation.\n",
    "\n",
    "As you may be familiar with the architecture of transformers, which involves attention heads, we similarly need to create attention heads when replicating LLaMA. To start, let’s first create a single **masked attention head** using the get_rotary_matrix function we previously developed for rotary embeddings. **Additionally, each line is commented for clarity**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RoPEMaskedAttentionHead(nn.Module):\n",
    "    def __init__(self, config):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "        # Linear transformation for query\n",
    "        self.w_q = nn.Linear(config['d_model'], config['d_model'], bias=False)\n",
    "        # Linear transformation for key\n",
    "        self.w_k = nn.Linear(config['d_model'], config['d_model'], bias=False)\n",
    "        # Linear transformation for value\n",
    "        self.w_v = nn.Linear(config['d_model'], config['d_model'], bias=False)\n",
    "        # Obtain rotary matrix for positional embeddings\n",
    "        self.R = get_rotary_matrix(config['context_window'], config['d_model'])\n",
    "\n",
    "    def get_rotary_matrix(context_window, embedding_dim):\n",
    "        # Initialize a tensor for the rotary matrix with zeros\n",
    "        R = torch.zeros((context_window, embedding_dim, embedding_dim), requires_grad=False)\n",
    "        \n",
    "        # Loop through each position in the context window\n",
    "        for position in range(context_window):\n",
    "            # Loop through each dimension in the embedding\n",
    "            for i in range(embedding_dim // 2):\n",
    "                # Calculate the rotation angle (theta) based on the position and embedding dimension\n",
    "                theta = 10000. ** (-2. * (i - 1) / embedding_dim)\n",
    "                # Calculate the rotated matrix elements using sine and cosine functions\n",
    "                m_theta = position * theta\n",
    "                R[position, 2 * i, 2 * i] = np.cos(m_theta)\n",
    "                R[position, 2 * i, 2 * i + 1] = -np.sin(m_theta)\n",
    "                R[position, 2 * i + 1, 2 * i] = np.sin(m_theta)\n",
    "                R[position, 2 * i + 1, 2 * i + 1] = np.cos(m_theta)\n",
    "        return R\n",
    "\n",
    "    def forward(self, x, return_attn_weights=False):\n",
    "        # x: input tensor of shape (batch, sequence length, dimension)\n",
    "\n",
    "        b, m, d = x.shape  # batch size, sequence length, dimension\n",
    "\n",
    "        # Linear transformations for Q, K, and V\n",
    "        q = self.w_q(x)\n",
    "        k = self.w_k(x)\n",
    "        v = self.w_v(x)\n",
    "\n",
    "        # Rotate Q and K using the RoPE matrix\n",
    "        q_rotated = (torch.bmm(q.transpose(0, 1), self.R[:m])).transpose(0, 1)\n",
    "        k_rotated = (torch.bmm(k.transpose(0, 1), self.R[:m])).transpose(0, 1)\n",
    "\n",
    "        # Perform scaled dot-product attention\n",
    "        activations = F.scaled_dot_product_attention(\n",
    "            q_rotated, k_rotated, v, dropout_p=0.1, is_causal=True\n",
    "        )\n",
    "\n",
    "        if return_attn_weights:\n",
    "            # Create a causal attention mask\n",
    "            attn_mask = torch.tril(torch.ones((m, m)), diagonal=0)\n",
    "            # Calculate attention weights and add causal mask\n",
    "            attn_weights = torch.bmm(q_rotated, k_rotated.transpose(1, 2)) / np.sqrt(d) + attn_mask\n",
    "            attn_weights = F.softmax(attn_weights, dim=-1)\n",
    "            return activations, attn_weights\n",
    "\n",
    "        return activations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a single masked attention head that returns attention weights, the next step is to create a multi-Head attention mechanism."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RoPEMaskedMultiheadAttention(nn.Module):\n",
    "    def __init__(self, config):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "        # Create a list of RoPEMaskedAttentionHead instances as attention heads\n",
    "        self.heads = nn.ModuleList([\n",
    "            RoPEMaskedAttentionHead(config) for _ in range(config['n_heads'])\n",
    "        ])\n",
    "        self.linear = nn.Linear(config['n_heads'] * config['d_model'], config['d_model'])  # Linear layer after concatenating heads\n",
    "        self.dropout = nn.Dropout(.1)  # Dropout layer\n",
    "\n",
    "    def forward(self, x):\n",
    "        # x: input tensor of shape (batch, sequence length, dimension)\n",
    "\n",
    "        # Process each attention head and concatenate the results\n",
    "        heads = [h(x) for h in self.heads]\n",
    "        x = torch.cat(heads, dim=-1)\n",
    "        \n",
    "        # Apply linear transformation to the concatenated output\n",
    "        x = self.linear(x)\n",
    "        \n",
    "        # Apply dropout\n",
    "        x = self.dropout(x)\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The original paper used 32 heads for their smaller 7b LLM variation, but due to constraints, we’ll use 8 heads for our approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update the master configuration with the number of attention heads\n",
    "MASTER_CONFIG.update({\n",
    "    'n_heads': 8,\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we’ve implemented Rotational Embedding and Multi-head Attention, let’s re-write our RMSNorm neural network model with the updated code. We’ll test its performance, compute the loss, and check the number of parameters. We’ll refer to this updated model as **“RopeModel”**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RopeModel(nn.Module):\n",
    "    def __init__(self, config):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "\n",
    "        # Embedding layer for input tokens\n",
    "        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])\n",
    "        \n",
    "        # RMSNorm layer for pre-normalization\n",
    "        self.rms = RMSNorm((config['context_window'], config['d_model']))\n",
    "        \n",
    "        # RoPEMaskedMultiheadAttention layer\n",
    "        self.rope_attention = RoPEMaskedMultiheadAttention(config)\n",
    "\n",
    "        # Linear layer followed by ReLU activation\n",
    "        self.linear = nn.Sequential(\n",
    "            nn.Linear(config['d_model'], config['d_model']),\n",
    "            nn.ReLU(),\n",
    "        )\n",
    "\n",
    "        # Final linear layer for prediction\n",
    "        self.last_linear = nn.Linear(config['d_model'], config['vocab_size'])\n",
    "\n",
    "        print(\"model params:\", sum([m.numel() for m in self.parameters()]))\n",
    "\n",
    "    def forward(self, idx, targets=None):\n",
    "        # idx: input indices\n",
    "        x = self.embedding(idx)\n",
    "\n",
    "        # One block of attention\n",
    "        x = self.rms(x)  # RMS pre-normalization\n",
    "        x = x + self.rope_attention(x)\n",
    "\n",
    "        x = self.rms(x)  # RMS pre-normalization\n",
    "        x = x + self.linear(x)\n",
    "\n",
    "        logits = self.last_linear(x)\n",
    "\n",
    "        if targets is not None:\n",
    "            loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))\n",
    "            return logits, loss\n",
    "\n",
    "        else:\n",
    "            return logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s execute the modified NN model with RMSNorm, Rotational Embeddings and Masked Multi Head Attentions to observe the updated number of parameters in the model, along with the loss:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an instance of RopeModel (RMSNorm, RoPE, Multi-Head)\n",
    "model = RopeModel(MASTER_CONFIG)\n",
    "\n",
    "# Obtain batches for training\n",
    "xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])\n",
    "\n",
    "# Calculate logits and loss using the model\n",
    "logits, loss = model(xs, ys)\n",
    "\n",
    "# Define the Adam optimizer for model parameters\n",
    "optimizer = torch.optim.Adam(model.parameters())\n",
    "\n",
    "# Train the model\n",
    "train(model, optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/20664/1*6MiNSfJf8x8xZqWkiy51Tg.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "The validation loss experiences a small decrease again, and the parameters of our updated LLM now total approximately 55,000.\n",
    "\n",
    "Let’s train the model for more epochs to see if the loss of our recreated LLaMA LLM continues to decrease or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Updating training configuration with more epochs and a logging interval\n",
    "MASTER_CONFIG.update({\n",
    "    \"epochs\": 5000,\n",
    "    \"log_interval\": 10,\n",
    "})\n",
    "\n",
    "# Training the model with the updated configuration\n",
    "train(model, optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/20664/1*JxlI2_wH8OHDrizFoJF6Kw.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "The validation loss continues to decrease, suggesting that training for more epochs could lead to further loss reduction, though not significantly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SwiGLU activation function:\n",
    "\n",
    "As mentioned before, the creators of LLaMA use SwiGLU instead of ReLU, so we’ll be implementing SwiGLU equation in our code.\n",
    "\n",
    "![[https://arxiv.org/pdf/2002.05202v1.pdf](https://arxiv.org/pdf/2002.05202v1.pdf)](https://cdn-images-1.medium.com/max/27072/1*db6BeMw78FH_ZkVEGIyVPA.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SwiGLU(nn.Module):\n",
    "    \"\"\"\n",
    "    Swish-Gated Linear Unit\n",
    "    https://arxiv.org/pdf/2002.05202v1.pdf\n",
    "    \"\"\"\n",
    "    def __init__(self, size):\n",
    "        super().__init__()\n",
    "        self.linear_gate = nn.Linear(size, size)\n",
    "        self.linear = nn.Linear(size, size)\n",
    "        self.beta = torch.randn(1, requires_grad=True)\n",
    "\n",
    "        self.beta = nn.Parameter(torch.ones(1))\n",
    "        self.register_parameter(\"beta\", self.beta)\n",
    "\n",
    "    def forward(self, x): \n",
    "        swish_gate = self.linear_gate(x) * torch.sigmoid(self.beta * self.linear_gate(x))\n",
    "        out = swish_gate * self.linear(x)\n",
    "        return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After implementing the SwiGLU equation in python, we need to integrate it into our modified LLaMA language model (**RopeModel**)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RopeModel(nn.Module):\n",
    "    def __init__(self, config):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "\n",
    "        # Embedding layer for input tokens\n",
    "        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])\n",
    "        \n",
    "        # RMSNorm layer for pre-normalization\n",
    "        self.rms = RMSNorm((config['context_window'], config['d_model']))\n",
    "        \n",
    "        # Multi-head attention layer with RoPE (Rotary Positional Embeddings)\n",
    "        self.rope_attention = RoPEMaskedMultiheadAttention(config)\n",
    "\n",
    "        # Linear layer followed by SwiGLU activation\n",
    "        self.linear = nn.Sequential(\n",
    "            nn.Linear(config['d_model'], config['d_model']),\n",
    "            SwiGLU(config['d_model']),  # Adding SwiGLU activation\n",
    "        )\n",
    "\n",
    "        # Output linear layer\n",
    "        self.last_linear = nn.Linear(config['d_model'], config['vocab_size'])\n",
    "\n",
    "        # Printing total model parameters\n",
    "        print(\"model params:\", sum([m.numel() for m in self.parameters()]))\n",
    "\n",
    "    def forward(self, idx, targets=None):\n",
    "        x = self.embedding(idx)\n",
    "\n",
    "        # One block of attention\n",
    "        x = self.rms(x)  # RMS pre-normalization\n",
    "        x = x + self.rope_attention(x)\n",
    "\n",
    "        x = self.rms(x)  # RMS pre-normalization\n",
    "        x = x + self.linear(x)  # Applying SwiGLU activation\n",
    "\n",
    "        logits = self.last_linear(x)\n",
    "\n",
    "        if targets is not None:\n",
    "            # Calculate cross-entropy loss if targets are provided\n",
    "            loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))\n",
    "            return logits, loss\n",
    "\n",
    "        else:\n",
    "            return logits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s execute the modified NN model with RMSNorm, Rotational Embeddings, Masked Multi Head Attentions and SwiGLU to observe the updated number of parameters in the model, along with the loss:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an instance of RopeModel (RMSNorm, RoPE, Multi-Head, SwiGLU)\n",
    "model = RopeModel(MASTER_CONFIG)\n",
    "\n",
    "# Obtain batches for training\n",
    "xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])\n",
    "\n",
    "# Calculate logits and loss using the model\n",
    "logits, loss = model(xs, ys)\n",
    "\n",
    "# Define the Adam optimizer for model parameters\n",
    "optimizer = torch.optim.Adam(model.parameters())\n",
    "\n",
    "# Train the model\n",
    "train(model, optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/23288/1*P3Z15lV1LNLZdsEpTd2mSA.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "Once again the validation loss experiences a small decrease, and the parameters of our updated LLM now total approximately 60,000.\n",
    "\n",
    "So far, we have successfully implemented the key components of the paper, namely RMSNorm, RoPE, and SwiGLU. We observed that these implementations led to a minimal decrease in the loss.\n",
    "\n",
    "Now we will add layers to our LLaMA to examine its impact on the loss. The original paper used 32 layers for the 7b version, but we will use only 4 layers. Let’s adjust our model settings accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update model configurations for the number of layers\n",
    "MASTER_CONFIG.update({\n",
    "    'n_layers': 4,  # Set the number of layers to 4\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s start by creating a single layer to understand its impact."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# add RMSNorm and residual conncection\n",
    "class LlamaBlock(nn.Module):\n",
    "    def __init__(self, config):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "\n",
    "        self.rms = RMSNorm((config['context_window'], config['d_model']))\n",
    "        \n",
    "        self.attention = RoPEMaskedMultiheadAttention(config)\n",
    "        self.feedforward = nn.Sequential(\n",
    "            nn.Linear(config['d_model'], config['d_model']),\n",
    "            SwiGLU(config['d_model']),\n",
    "        )\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.rms(x) # rms pre-normalization\n",
    "        x = x + self.attention(x)\n",
    "\n",
    "        x = self.rms(x) # rms pre-normalization\n",
    "        x = x + self.feedforward(x)\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create an instance of the LlamaBlock class and applies it to a random tensor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MASTER_CONFIG"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "block = LlamaBlock(MASTER_CONFIG)\n",
    "block(torch.randn(MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'], MASTER_CONFIG['d_model']));"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an instance of the LlamaBlock class with the provided configuration\n",
    "block = LlamaBlock(MASTER_CONFIG)\n",
    "\n",
    "# Generate a random tensor with the specified batch size, context window, and model dimension\n",
    "random_input = torch.randn(MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'], MASTER_CONFIG['d_model'])\n",
    "\n",
    "# Apply the LlamaBlock to the random input tensor\n",
    "output = block(random_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having successfully created a single layer, we can now use it to construct multiple layers. Additionally, we will rename our model class from **“ropemodel”** to **“Llama”** as we have replicated every component of the LLaMA language model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import OrderedDict\n",
    "\n",
    "MASTER_CONFIG.update({\n",
    "    'n_layers': 4,\n",
    "})\n",
    "\n",
    "class Llama(nn.Module):\n",
    "    def __init__(self, config):\n",
    "        super().__init__()\n",
    "        self.config = config\n",
    "        # Embedding layer for token representations\n",
    "        self.embeddings = nn.Embedding(config['vocab_size'], config['d_model'])\n",
    "        # Sequential block of LlamaBlocks based on the specified number of layers\n",
    "        self.llama_blocks = nn.Sequential(\n",
    "            OrderedDict([(f\"llama_{i}\", LlamaBlock(config)) for i in range(config['n_layers'])])\n",
    "        )\n",
    "        # Feedforward network (FFN) for final output\n",
    "        self.ffn = nn.Sequential(\n",
    "            nn.Linear(config['d_model'], config['d_model']),\n",
    "            SwiGLU(config['d_model']),\n",
    "            nn.Linear(config['d_model'], config['vocab_size']),\n",
    "        )\n",
    "\n",
    "        # Print total number of parameters in the model\n",
    "        print(\"model params:\", sum([m.numel() for m in self.parameters()]))\n",
    "\n",
    "    def forward(self, idx, targets=None):\n",
    "        # Input token indices are passed through the embedding layer\n",
    "        x = self.embeddings(idx)\n",
    "        # Process the input through the LlamaBlocks\n",
    "        x = self.llama_blocks(x)\n",
    "        # Pass the processed input through the final FFN for output logits\n",
    "        logits = self.ffn(x)\n",
    "\n",
    "        # If targets are not provided, return only the logits\n",
    "        if targets is None:\n",
    "            return logits\n",
    "        # If targets are provided, compute and return the cross-entropy loss\n",
    "        else:\n",
    "            loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))\n",
    "            return logits, loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s execute the modified LLaMA model with RMSNorm, Rotational Embeddings, Masked Multi Head Attentions, SwiGLU and N_layers to observe the updated number of parameters in the model, along with the loss:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an instance of RopeModel (RMSNorm, RoPE, Multi-Head, SwiGLU, N_layers)\n",
    "llama = Llama(MASTER_CONFIG)\n",
    "\n",
    "# Obtain batches for training\n",
    "xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])\n",
    "\n",
    "# Calculate logits and loss using the model\n",
    "logits, loss = llama(xs, ys)\n",
    "\n",
    "# Define the Adam optimizer for model parameters\n",
    "optimizer = torch.optim.Adam(llama.parameters())\n",
    "\n",
    "# Train the model\n",
    "train(llama, optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/26104/1*hNJ7JxurShESQj6E5fsNrg.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "While there’s a possibility of overfitting, it’s crucial to explore whether extending the number of epochs leads to a further reduction in loss. Additionally, note that our current LLM has over 2 million parameters.\n",
    "\n",
    "Let’s train it for higher number of epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update the number of epochs in the configuration\n",
    "MASTER_CONFIG.update({\n",
    "    'epochs': 10000,\n",
    "})\n",
    "# Train the LLaMA model for the specified number of epochs\n",
    "train(llama, optimizer, scheduler=None, config=MASTER_CONFIG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/26104/1*qKCUQza7EbFPYIO0IJd6ew.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "The loss here is 1.08, we can achieve even more lower loss without encountering significant overfitting. This suggests the model is performing well.\n",
    "\n",
    "Let’s train the model once more, this time incorporating a scheduler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Training the model again, scheduler for better optimization.\n",
    "train(llama, optimizer, config=MASTER_CONFIG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/26104/1*ov1mEWSeLXz8NACkErQZbA.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "Up until now, we’ve successfully implemented a scaled-down version of the LLaMA architecture on our custom dataset. Now, let’s examine the generated output from our 2 million-parameter Language Model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate text using the trained LLM (llama) with a maximum of 500 tokens\n",
    "generated_text = generate(llama, MASTER_CONFIG, 500)[0]\n",
    "print(generated_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://cdn-images-1.medium.com/max/31844/1*NRnC0gQnbdCFkYaROiVbtw.png\" alt=\"\" width=\"50%\">\n",
    "\n",
    "\n",
    "Even though some generated words may not be perfect English, our LLM with just 2 million parameters has shown a basic understanding of the English language.\n",
    "\n",
    "Now, let’s see how well our model performs on the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get batches from the test set\n",
    "xs, ys = get_batches(dataset, 'test', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])\n",
    "\n",
    "# Pass the test data through the LLaMA model\n",
    "logits, loss = llama(xs, ys)\n",
    "\n",
    "# Print the loss on the test set\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The computed loss on the test set is approximately 1.236.\n",
    "\n",
    "A simple way to check for changes in the generated output is to run training for a large number of epochs and observe the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experimenting with hyperparameters\n",
    "\n",
    "Hyperparameter tuning is a crucial step in training neural networks. In the original Llama paper, the authors utilized the Cosine Annealing learning schedule. However, in our experimentation, it didn’t perform well. Here’s an example of experimenting with hyperparameters using a different learning schedule:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update configuration\n",
    "MASTER_CONFIG.update({\n",
    "    \"epochs\": 1000\n",
    "})\n",
    "\n",
    "# Create Llama model with Cosine Annealing learning schedule\n",
    "llama_with_cosine = Llama(MASTER_CONFIG)\n",
    "\n",
    "# Define Adam optimizer with specific hyperparameters\n",
    "llama_optimizer = torch.optim.Adam(\n",
    "    llama.parameters(),\n",
    "    betas=(9, .95),\n",
    "    weight_decay=.1,\n",
    "    eps=1e-9,\n",
    "    lr=1e-3\n",
    ")\n",
    "\n",
    "# Define Cosine Annealing learning rate scheduler\n",
    "scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(llama_optimizer, 300, eta_min=1e-5)\n",
    "\n",
    "# Train the Llama model with the specified optimizer and scheduler\n",
    "train(llama_with_cosine, llama_optimizer, scheduler=scheduler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In this notebook, we’ve walked through a step-by-step process on how to implement the LLaMA approach to build your own small Language Model (LLM). As a suggestion, consider expanding your model to around 15 million parameters, as smaller models in the range of 10M to 20M tend to comprehend English better. Once your LLM becomes proficient in language, you can fine-tune it for specific use cases.\n",
    "\n",
    "If you find this notebook helpful, there's no \"like\" option, but you can connect with me on [LinkedIn](https://www.linkedin.com/in/fareed-khan-dev/) or follow me on [Medium](https://medium.com/@fareedkhandev)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cloudspace",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}