AI/ML/Deep Learning Cheat Sheet

Nov 19, 2024·
Christopher Coverdale
Christopher Coverdale
· 11 min read

Cheat Sheet

Glossary

  • Dependent Variable - The variable a Neural Network is trained to predict.

  • Independent Variable - Variables that are used as training input to be able to predict the Dependent Variable

  • Continuous Variables - Variables that are not discrete and can have a range between two points. They can be be natively used in multiplication e.g. price, age, height, weight.

  • Categorical Variables - Discrete variables are used in classifying or labelling an object. They are not easily used in multiplication or have no meaning in multiplication e.g. a color, a random user id, country name.

  • Embedding - An Embedding is a collection of numbers that are used to represent a Categorical Variable. Since Categorical Variables do not have an inherent numerical value, the Embeddings are used to to represent them and are used in the training process.

  • Activation (Function) - A function that introduces non-linearity into a model. Basically it turns output from one layer into input consumable to the next layer.

  • Parameters - The randomized weights + bias as the intial input but also the adjusted weights + bias after calculating and applying the gradients

  • baseline - this is almost like the most naive or simplest approach to prove a model, and then build from there.

  • tensor - a matrix or mathemtical object that represents mutli-dimensional data

  • rank0-tensor - a scalar e.g. a

  • rank1-tensor - a vector e.g [a, b]

  • rank2-tensor - a matrix (rows and columns) e.g.

[[a,b], [c, d]]
  • rank3-tensor - a 3 dimensional matrix (a list of matrices) e.g.
[
[[a, b], [c, d]], # Matrix 1
[[e, f], [g, h]]  # Matrix 2
]
  • the shape of a tensor - it defines the size and dimensions of a tensor

  • SGD (Stochastic Gradient Descent) - It calculates the gradients of the output of a loss function and decreases the parameters according to the gradients multiplied by a learning rate.

  • transformer - An NLP specific neural net architecture

  • RNN (recurrent neural network) - Used for NLP but being replaced by transformer based AI and LLM. It excels in processing sequential data and uses a “memory” in a hidden layer to remember previous inputs. This is useful when making predictions like in NLP. e.g. (Given the prompt Apple is… then from the memory it can recall the highest probability Apple is red"

  • Feed Forward Neural Network - The main neural network we’ve been working on. It’s a “simpler” neural network that simplify passes information forward to the next layer, used for things like simple classification. This neural network doesn’t have a “memory”, so it forgets what the previous input was once it passes it to the next layer.

  • Convolution Neural Network - Mainly used for understanding spatial information e.g. videos and images

  • temperature - Used in NLP models and maybe other models that do generation, to create a level of randomness. Low temperature means the model will have a bias to choosing the most probable text. If the temperature is higher, it has a higher probability of choosing more random (less probable) text.

  • residual connection - A strategy to mitigate vanishing gradients. It adds the output of a function together with its original input. It stabilizes gradient flow during backpropagation and allows to create identity mappings. It is used in transformers when passing the output of the multi-head attention and the original input together to the Add + Norm layer.

y = F(x) + x
  • auto regression - models that predict the next item in a series based only on knowledge of the previous tokens

  • weight tying - A performance improvement strategy to combine (reuse) the same weights of input embeddings with the final output layers (LM Head). This reduces the number of parameters and often improves generalization

Pytorch

Define a tensor

tensor1 = torch.tensor([1, 2, 3])

Stack

This creates a new tensor by adding a new dimension for given tensors.

For example, below takes x2 rank1 tensors and stacks them to create a rank2 tensor

import torch

# Define two tensors
tensor1 = torch.tensor([1, 2, 3])
tensor2 = torch.tensor([4, 5, 6])

# Stack along a new dimension (rows)
result = torch.stack((tensor1, tensor2), dim=0)

print(result)

>>>
tensor([[1, 2, 3],
        [4, 5, 6]])

Dataset

An abstract class in Pytorch that defines how data from a dataset can be retrieved using __getitem__ and __len__.

When using subclassing from Dataset, its expected to implement __getitem__ and __len__.

Below is an example:

from torch.utils.data import Dataset

class SimpleDataset(Dataset):
    def __init__(self, data, input_vocab, target_vocab, input_tokenizer, target_tokenizer):
        self.data = data
        self.input_vocab = input_vocab
        self.target_vocab = target_vocab
        self.input_tokenizer = input_tokenizer
        self.target_tokenizer = target_tokenizer

        for input_txt, target_txt in data:
            self.input_vocab.add_sentence(self.input_tokenizer(input_txt))
            self.target_vocab.add_sentence(self.target_tokenizer(target_txt))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input_txt, target_txt = self.data[idx]
        input_tokens = self.input_tokenizer(input_txt)
        target_tokens = self.target_tokenizer(target_txt)
        input_indices = self.input_vocab.numericalize(input_tokens)
        target_indices = self.target_vocab.numericalize(target_tokens)

        return input_indices, target_indices

input_vocab = Vocabulary()
target_vocab = Vocabulary()

dataset = SimpleDataset(data, input_vocab, target_vocab, nlp_en, nlp_pt)

In this example, when the subclassed Dataset is initialized, it builds the vocabulary using the data and the tokenizer.

The length of the dataset can be retrieved using __len__ and the loaded dataset can be retrieved with the implementation of __getitem__.

DataLoader

The Dataloader class from pytorch will wrap the Dataset and helps with the following:

  • load the data in batches for training
  • shuffles the data
  • loads the data in parallel for multiple workers
  • pins memory for faster transfer to the GPU
  • allows a collate function to be passed to handle grouping/padding of the dataset
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)

nn.Embedding

Used to create embeddings for categorical variables are any representation of a non-continuous variable.

  • Creates x number of embeddings by y number of dimensions
torch.nn.Embedding(num_embeddings, embedding_dim)

Example:

embedding = nn.Embedding(10, 3)
input = torch.tensor([1, 2, 3, 4, 8])
output = embedding(input)
output

>>>
tensor([[ 0.5843,  1.4902,  0.3140],
        [ 0.5584, -0.3696, -0.5838],
        [-0.3950, -0.2971,  0.1814],
        [-1.4518,  0.0082,  0.0437],
        [-0.6157,  1.0340, -0.5854]], grad_fn=<EmbeddingBackward0>)

nn.Linear

A fully connected layer that applies a linear transformation.

y = xW^T + b
  • x is the input
  • W is the weight matrix
  • b is the bias vector
nn.Linear(in_features, out_features, bias=True)
  • in_features: number of input features e.g. 128
  • out_features: number of output features e.g. 64
  • bias: adds a learnable bias, default to true

unsequeeze

Unsequeeze is a function to add a new dimension to a tensor at a certain position.

  • This adds a new dimension to the tensor at position 0
pe = pe.unsqueeze(0)

triu (triangle upper)

Returns a copy of the tensor where all elements below the diagonal line are set to 0.

This is useful when creating look ahead masks in Transformers.

a = torch.ones((4, 4))
a
>>>

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])
  • diagonal is the starting index position of the diagonal line
torch.triu(a, diagonal=1)

>>>
tensor([[0., 1., 1., 1.],
        [0., 0., 1., 1.],
        [0., 0., 0., 1.],
        [0., 0., 0., 0.]])

nn.ModuleList

Stacks Layers in a Neural Network its different form nn.Sequential since sequential will auto-execute each layer, ModuleList requires a loop implementation that allows custom logic behaviour between each layer.

 self.layers = nn.ModuleList([
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        ])

def forward(self, x):
    for layer in self.layers:
        # Do some other stuff
        x = layer(x) + x  # Add residual, or any custom logic
    return x

view

A function that creates a new view into a tensor and also changes the shape of the view. It does not change the data.

The example below creates a view of 2 rows of 3 columns into 3 rows of 2 columns.

x = torch.tensor([
    [1, 2, 3],
    [4, 5, 6]
])  # shape: [2, 3]
print(x.shape)

# Creates a new view of the original data, splitting the 2 rows of 3, into 3 rows of 2
# → shape: [3, 2]
# → tensor([[1, 2], [3, 4], [5, 6]])
x.view(3, 2)

>>>
x = torch.tensor([
    [1, 2, 3],
    [4, 5, 6]
])  # shape: [2, 3]
print(x.shape)

# Creates a new view of the original data, splitting the 2 rows of 3, into 3 rows of 2
# → shape: [3, 2]
# → tensor([[1, 2], [3, 4], [5, 6]])
x.view(3, 2)

Pandas

Loading a csv

df = pd.read_csv("./decision_tree_sample.csv")

Writing a data frame to csv

df.to_csv('output.csv')
  • If you don’t need the indexes of the rows
df.to_csv('output.csv', index=False)

DataFrame Columns

df.columns

Iterating through column values

for column in df.columns:
    for value in df[column]:
        print(value)

Iterating over rows

for idx, row in df.iterrows():
    print(row)

Mutation through a Vectorized Approach

The Vectorized Approach is a more efficient pattern for manipulating row entires by processing column data at one time instead of looping through each row and doing calculations and updates per row.

The example below is not the best example since it can be handled using functions in the pandas library more efficiently but it demonstrates the vectorized approach for updates.

It calculates all the lag values for each column and then updates the whole column after the calculation.

def add_lag_features(df, num_weeks):
    # Create the lag feature columns in the dataframe.
    for i in range(1, 4):
        if f"LAG_{i}" not in df.columns:
            df[f"LAG_{i}"] = 0

    buckets = setup_buckets(df, num_weeks)

    # Add lag features using vectorized operations
    # LAG_1, LAG_2, LAG_3
    lag_nums = [1, 2, 3]
    for num in lag_nums:
        lag_col = f"LAG_{num}"
        lag_values = []

        for idx, row in df.iterrows():
            week = row["WEEK"]
            lag_idx = week - num

            if lag_idx > 0:  # Valid lag index
                name = row["NAME"]
                lag_quantity = buckets.get(lag_idx, {}).get(name, {}).get("UNITS", 0)
            else:
                lag_quantity = 0  # Default value for invalid lag

            lag_values.append(lag_quantity)

        # NOTE!!! - IMPORTANT PART HERE
        df[lag_col] = lag_values  # Assign the calculated lag values

    return df

Drop Columns

df = df.drop(columns=["COLUMN_A", "COLUMN_B"])

Get row by index

df.iloc[idx]

Filtering rows based on column value

  • We can get a view to a subset of the data frame by filtering
week_1 = df.loc[new_df["WEEK"] == 1]

Splitting data frames

  • We can split dataframes on a condition
condition = df["Age"] < 50
sample_split = df[condition]

Boolean Mask Condition

  • We can also use a boolean mask to inverse a condition.

  • This example splits the data frame into two subsets using a boolean mask

conditoin = df["Age"] < 50

group_1 = df[condition]
group_2 = df[~condition]

Getting the last item

Using tail, to get the last item in a data frame

last_item = df.tail(1)

Update all rows in a column

  • This will also create a new column and assign the value to all rows
df["WEEK"] = new_week

Mutate the original dataframe

  • loc tells pandas to mutate the original data frame NOT a copy
  • apply, applies a lambda function for every row in the column
df.loc[:, "WEEK"] = df["WEEK"].apply(lambda x: num_weeks + 1 - x)

Reset indexes

  • Sometimes when combining data frames, there might be duplicate indexes. Resetting it prevents any collisions.
new_df = pd.DataFrame(rows).reset_index(drop=True)

Mapping values to a Column

  • This approach maps new values in a column according to the values of an existing column

  • The example below has a list of dates and the correspond to a value in a WEEK, e.g. [15, 14, 13, … 1]

  • This creates a dictionary of the WEEK to the dates and then creates a new column using the mapped values

df = pd.read_csv("./example.csv")
dates = [
    "Jan 14, 2025",
    "Jan 07, 2025",
    "Dec 31, 2024",
    "Dec 23, 2024",
    "Dec 17, 2024",
    "Dec 11, 2024",
    "Dec 04, 2024",
    "Nov 26, 2024",
    "Nov 19, 2024",
    "Nov 14, 2024",
    "Nov 12, 2024",
    "Nov 04, 2024",
    "Oct 31, 2024",
    "Oct 28, 2024",
    "Oct 23, 2024",
]

# Map the dates to the weeks
unique_weeks = sorted(df["WEEK"].unique(), reverse=True)
week_date_mapping = dict(zip(unique_weeks, dates))
df["DATE"] = df["WEEK"].map(week_date_mapping)

groupby and transform

  • groupby will group the rows in the dataframe that share the same value in the column
import pandas as pd

df = pd.DataFrame({
    'NAME': ['A', 'A', 'B', 'B', 'C', 'C'],
    'WEEK': [1, 2, 1, 2, 1, 2],
    'UNITS': [10, 15, 20, 25, 30, 35]
})

grouped = df.groupby("NAME")
  • The groups can be viewed:
for name, group in grouped:
    print(f"Group: {name}")
    print(group)
    print("---")

>>>

Group: A
  NAME  WEEK  UNITS
0    A     1     10
1    A     2     15
---
Group: B
  NAME  WEEK  UNITS
2    B     1     20
3    B     2     25
---
Group: C
  NAME  WEEK  UNITS
4    C     1     30
5    C     2     35
  • Aggregates can be applied to the groups using the columns:
df.groupby("NAME")["UNITS"].sum()

>>>

NAME
A    25
B    45
C    65
  • transform will keep the original DataFrame shape but apply a function based on the group

  • This groups the dataframe by NAME and transforms the original dataframe based on the UNITS.

  • This will calculate the rolling mean for each item based on NAME using the UNITS for the rolling mean calculation

df["rolling_mean_2"] = df.groupby("NAME")["UNITS"].transform(lambda x: x.rolling(2).mean())

Rolling Mean

  • Below is an example of calculating a rolling mean with a window of size 4

  • x.shift(1): shift the to previous entry

  • .rolling(4): sum the previous 4 items

  • .mean(): get the mean of the window

Any items that cannot be calculated are given NaN, fillna(0, inplace=True) will replace NaN with 0

df["ROLLING_MEAN_4"] = self.df.groupby("NAME")["UNITS"].transform(
    lambda x: x.shift(1).rolling(4).mean()
)

df.fillna(0, inplace=True)

Data Science

Rolling Mean/Average

  • Useful in time series data

  • A Rolling Mean/Average represents a rolling average of the previous values

  • Usually implemented with a certain window, e.g. the rolling average of the previous 4 data points

  • It identifies longer-term patterns or cycles

  • Especially useful in noisy data where random variations can obscure the underlying long term pattern

The rolling mean (or moving average) at time t t with a window size n n is calculated as:

Rolling Meant=1ni=tn+1tyi \text{Rolling Mean}_t = \frac{1}{n} \sum_{i=t-n+1}^{t} y_i
  • Rolling Meant \text{Rolling Mean}_t is the rolling mean at time t t

  • n n is the window size

  • yi y_i represents the data point at time i i

  • Essentially, yi...yity_i ... y_i-t is summed and then divided by nn

Did you find this page helpful? Consider sharing it 🙌