What is the definition of artificial intelligence according to Kurzweil, 1990?
The art of creating machines that perform functions requiring intelligence
What is Computational Intelligence according to Poole et al., 1998?
The study of the design of intelligent agents
What does Nilsson, 1998 say about AI?
AI is concerned with intelligent behaviour in artifacts
What is the focus of Charnak and McDermott, 1985 regarding AI?
The study of mental faculties through computational models
What is Winston, 1992's perspective on AI?
The study of computations that enable perception, reasoning, and acting
What is Haugeland, 1986's definition of AI?
The effort to make computers think like humans
What is Bellman, 1978's view on AI?
The automation of activities associated with human thinking
What approach does this course take towards AI?
The human route, programming computers to act humanly or learn from experience
What is a key application of ML mentioned?
Robotics
What is another application of ML?
Self-driving cars
What is an example of ML application in medicine?
Detecting sepsis in MRI scans
What is machine learning?
The field of machine learning is concerned with constructing computer programs that automatically improve with experience.
Who defined machine learning in 1977?
Tom Mitchell.
What is the focus of machine learning according to Tom Mitchell's 1997 definition?
A computer program learns from experience E with respect to tasks T and performance measure P if its performance improves with experience.
What does the function 'f' calculate in the example?
Student's grades in Intro to ML.
What is 'h' in the example?
An approximation function used to estimate grades based on past data.
What are the three main categories of machine learning settings?
Supervised, Unsupervised, and Reinforcement.
What does supervised learning do?
Produces a model capable of generating correct output labels.
What is unsupervised learning?
No labels are given; algorithms find patterns in the data.
What is clustering in unsupervised learning?
Dividing data into groups based on similarities, like dogs and cats.
What is dimensionality reduction?
Identifying important features in data, like enhancing a blurry image of a face.
What is reinforcement learning?
An algorithm interacts with the environment to produce a reward signal for improvement.
What is policy search in reinforcement learning?
Finding actions for an agent to maximize received rewards based on its state.
What is semi-supervised learning?
Some data have labels, some do not; aims to label unlabelled data using labelled items.
What is weakly-supervised learning?
Inexact output labels; e.g., indicating an item is somewhere in an image without precise location.
What is classification in machine learning?
Assigning discrete or categorical variables to inputs, like predicting actions in videos.
What is binary classification?
A classification task with only 2 labels to choose from.
What is multi-class classification?
A classification task with more than 2 labels to choose from.
What is multi-label classification?
A classification task where multiple labels can be correct for a single input.
What is regression in machine learning?
Assigning a real/continuous float value to an input.
What is simple regression?
A regression with 1 input variable and 1 output variable.
What is multiple regression?
A regression with multiple input variables and 1 output variable.
What is simple regression?
1 input variable and 1 output variable. E.g., size of a house predicts its price.
What is multiple regression?
Multiple input variables and 1 output variable. E.g., grade calculator with 3 inputs and 1 output (grade).
What is multivariate regression?
Multiple inputs to predict multiple outputs. E.g., predicting the location of an umbrella from a picture.
What is an example regression problem?
Given time as input, the regressor predicts the value at that time.
What characterizes a bad predictor in regression?
The line is far off from almost all points.
What characterizes a good predictor in regression?
The line is close to most points, even if it is off.
What characterizes a very good predictor in regression?
It predicts given points well but may struggle with unknown examples.
What is supervised learning?
Most common setting in ML problems, typically involves classification and regression.
How does Antoine classify shapes?
By placing data along 2 axes (colour and points) to create a classifier.
What is a linear classifier?
A classifier that uses a straight line to separate data into categories.
What have we learnt about data in predictions?
More data leads to more accurate predictions.
Why is selecting good features important?
Good features improve prediction accuracy; combining features is often better.
What are two ways to make predictions?
What is the goal of generating a model in supervised learning?
To approximate the true function using input data to predict outputs.
What is the training dataset defined as?
A sequence of pairs of input and output labels (Xn and yn).
What is feature encoding in supervised learning?
Transforming raw input observations into a modified version (feature space).
What is the purpose of the Xtest dataset?
To evaluate model performance on unseen data by comparing predicted outputs with ground truth.
What do we compute to measure model performance?
A score comparing predicted outputs with the ground truth/gold standard annotation.
What is the purpose of the truth/gold standard annotation?
To compute a score measuring model performance.
What is the first step in the complete pipeline?
Feature Encoding
Why is it important to examine data before designing an algorithm?
It can provide clues for classifier design and help identify class label distribution.
What happens if class labels are imbalanced?
The algorithm may learn to identify only the majority class.
What should you do with features before starting an algorithm?
Normalize your features.
How do you normalize features?
Subtract the mean and divide by the standard deviation.
What is the curse of dimensionality?
As dimensions increase, data becomes sparse and training data may be noisy.
What is feature selection?
Choosing a subset of original features to work with.
What is feature extraction?
Generating a new set of features from the original features.
What is the Bag of Words method in NLP?
Logging the frequency of words without tracking their positions.
What is the modern approach to feature encoding in deep learning?
Letting the algorithm figure out optimal features from raw data.
What is a lazy learner?
Stores training examples and generalizes upon explicit request at test time.
What is an eager learner?
Constructs a general description of the target function before test time.
What is the opposite of the other guy in ML models?
Learns and generalises all it can before test time, resulting in quicker test time.
What is a Non-Parametric Model?
Assumes no fixed form; trusts the data instead of a function.
What is an example of a Non-Parametric Model?
Nearest neighbour is a lazy learner.
How does a nearest neighbour classifier work?
Looks at the nearest neighbour and classifies itself as the same.
What is a Linear Model?
Assumes the data is linearly separable, learning the best line to separate it.
What does a Linear Model classify?
Anything on the left as a green diamond, anything on the right as a red circle.
What is a Non-Linear Model?
Used for non-linearly separable problems with more complex models.
What is Feature Space Transformation?
Representing data differently to analyze and separate it more easily.
How do SVMs solve non-linear datasets?
Use a kernel for transformation.
How do Neural Networks handle non-linear datasets?
Try to learn how to transform the feature space automatically.
What is the Bias-Variance trade-off?
A balance between overfitting (high variance) and underfitting (high bias).
What is Occam’s razor in ML?
Choose the simpler model if two models perform similarly.
What does MSE stand for?
Mean Squared Error, measures average square distance between correct and predicted outputs.
Is 85% accuracy good?
Accuracy is relative; depends on baseline and upper bound performance.
What is the Baseline in performance evaluation?
The lower bound for performance, often chance/random performance.
What is the Upper bound in performance evaluation?
The best case, often compared to human performance.
What is K-Nearest Neighbours?
A lazy learner that stores data until a request is made.
What are Decision Trees in ML?
Eager learners that process all data upfront and discard it after analysis.
What does the Nearest Neighbour Classifier do?
Classifies a test instance to the class label of the nearest training instance.
What does k-NN stand for?
k-nearest neighbours
What type of model is k-NN?
Non-parametric model
What is a major problem with k-NN?
Sensitive to noise
What is the solution to overfitting in k-NN?
Use k nearest neighbours
What does increasing k do to the classifier?
Makes the decision boundary smoother and less sensitive to training data
How should k be chosen in k-NN?
Using a validation dataset
What are some distance metrics used in k-NN?
Mahalanobis distance, Hamming distance
What does distance-weighted k-NN do?
Assigns weights to neighbours based on their distance
What happens if k=N in weighted k-NN?
It becomes a global method
What is a disadvantage of k-NN for large datasets?
It can be slow
What is the curse of dimensionality in k-NN?
Distance metrics may not work well in high dimensional spaces
How does k-NN perform regression?
Computes the mean value across k nearest neighbours
What is the principle of decision trees?
Focus on a specific subset or feature to make decisions
What type of learners are decision trees?
Eager learners
What is decision tree learning?
A method for approximating discrete classification functions using a tree-based representation.
How can decision trees be represented?
As a set of if-then rules.
What type of search do decision tree learning algorithms use?
Top-down greedy search through the space of possible solutions.
Name some algorithms for constructing decision trees.
ID3, C4.5, CART.
What is the first step in the general decision tree algorithm?
Search for the optimal splitting rule on training data.
What is the goal of finding an optimal split rule?
To create partitioned datasets that are more 'pure' than the original dataset.
What does Information Gain measure?
The reduction of information entropy.
What does Gini Impurity measure?
The probability of incorrectly classifying a randomly picked point according to class label distribution.
What is Variance Reduction mainly used for?
Regression trees where the target variable is continuous.
Who introduced the concept of entropy in information theory?
Claude Shannon (1916-2001).
What does entropy measure?
The uncertainty of a random variable.
What is the formula for the amount of information required to determine the state of a random variable?
I(x) = log2(K).
How is the amount of information related to probability?
I(x) = -log2(P(x)).
What happens to information required when the impostor is more likely in one box?
Low entropy; less new information is gained.
What is the information required when the impostor is equally likely in 4 boxes?
I(x) = -log2(1/4) = 2 bits.
What does low entropy indicate?
You don’t need to know a lot of information to predict the value of a random variable.
What does high entropy indicate?
A lot of new information is gained when predicting the value of a random variable.
What is the entropy of box 1?
0.0439 bits (LOW entropy)
What is the entropy of box 2?
6.6439 bits (HIGH entropy)
How is entropy defined?
Average amount of information: 𝐻(𝑋) = −∑ 𝑃(𝑥𝑘)𝑙𝑜𝑔2(𝑃(𝑥𝑘))
What is the continuous entropy formula?
𝐻(𝑋) = −∫ 𝑓(𝑥)𝑙𝑜𝑔2(𝑓(𝑥))𝑑𝑥
What does a 50:50 split of information represent?
Average entropy of 1 (more random outcome)
What is information gain?
Difference between initial entropy and weighted average entropy of subsets.
What is the formula for information gain?
𝐼𝐺(𝑑𝑎𝑡𝑎𝑠𝑒𝑡, 𝑠𝑢𝑏𝑠𝑒𝑡𝑠) = 𝐻(𝑑𝑎𝑡𝑎𝑠𝑒𝑡) − ∑ |𝑆|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆)
What is the binary tree information gain formula?
𝐼𝐺(𝑑𝑎𝑡𝑎𝑠𝑒𝑡, 𝑠𝑢𝑏𝑠𝑒𝑡𝑠) = 𝐻(𝑑𝑎𝑡𝑎𝑠𝑒𝑡) − (|𝑆𝑙𝑒𝑓𝑡|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆𝑙𝑒𝑓𝑡) + |𝑆𝑟𝑖𝑔ℎ𝑡|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆𝑟𝑖𝑔ℎ𝑡))
What are ordered values in decision trees?
Attribute and split point (e.g., weight < 60)
What are categorical values in decision trees?
Search for the most informative feature, create branches for each value.
What is the first step in using ID3 algorithm?
Find the entropy of the initial dataset.
What is the entropy of the dataset D with 9 positive and 5 negative outcomes?
𝐻(𝐷) = 0.940
What is the entropy for 'sunny' outcomes?
𝐻(𝐷𝑠𝑢𝑛𝑛𝑦) = 0.971
What is the entropy for 'overcast' outcomes?
𝐻(𝐷𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = 0
What is the entropy for 'rain' outcomes?
𝐻(𝐷𝑟𝑎𝑖𝑛) = 0.971
What is the formula for information gain for 'outlook'?
IG(D, outlook) = H(D) - (5/14 H(Dsunny) + 4/14 H(Dovercast) + 5/14 H(Drain))
What is the total number of days in the dataset?
14 days
What is the information gain for 'outlook'?
0.246
What happens to the 'overcast' subset?
It is labeled as a tick since all outcomes are positive (1).
What is a common issue with decision trees?
They can overfit the data.
What is one method to deal with overfitting in decision trees?
Stopping early or pruning the tree.
What is the validation set size in cross-validation?
20% of the provided data.
What is the first step in pruning a decision tree?
Go through each internal node connected only to leaf nodes.
What does a random forest consist of?
A collection of decision trees trained on different subsets of data.
What is the outcome of the algorithm in a random forest?
The majority vote by all the different trees.
What do regression trees predict?
A real-valued number instead of a class label.
What is used instead of information gain for regression trees?
Variance reduction.
How do you make predictions with regression trees?
By taking an average or weighted average of samples in the leaves.
What is the purpose of taking an average in machine learning predictions?
To make predictions based on the distance of different samples in the leaves of the tree.
What is the ultimate goal when creating machine learning systems?
To develop models that generalise to previously unseen examples.
What is a held-out test dataset used for?
To measure the performance of a model on unknown data.
Why is shuffling important before splitting a dataset?
To avoid implicit ordering in the dataset that can bias results.
What are hyperparameters in machine learning?
Model parameters chosen before training, such as 'k' in k-NN.
What is the motivation behind hyperparameter tuning?
To choose hyperparameter values that give the best performance.
What is a disadvantage of testing hyperparameters on the training dataset?
It usually does not generalise well to unseen examples.
What should never be done when evaluating hyperparameters?
Using the test dataset to select hyperparameters based on accuracy.
What is the correct approach for dataset splitting in machine learning?
Split into training, validation, and test sets, e.g., 60:20:20.
What is the purpose of the validation set?
To select the best hyperparameters based on accuracy.
What is hyperparameter tuning/optimisation?
Selecting parameters that produce the best classifier performance.
What can be done for final evaluation after hyperparameter tuning?
Optionally include the validation set back into the training set.
What can be included in the training set for final evaluation?
Validation set can be included to retrain the model on the whole dataset after finding best hyperparameters.
What is the purpose of including the validation set in training?
It provides more data for training, potentially increasing model performance.
When is the final evaluation done?
The final evaluation is done on the test dataset.
What is a risk of developing and evaluating a model on the same data?
It results in overfitting the model to the training data.
What should the test set be used for?
The test set should only be used for estimating performance on unknown examples.
What is cross-validation used for?
Cross-validation is used when the dataset is small to ensure effective testing.
What are the steps in cross-validation?
What does the global error estimate formula represent?
It averages performance metrics across all k held-out test sets.
What is important about cross-validation in model evaluation?
It evaluates an algorithm rather than a single trained instance of a model.
What is one option for parameter tuning during cross-validation?
Use 1 fold for testing, 1 for validation, and k-2 for training in each iteration.
What is an alternative method for parameter tuning in cross-validation?
Cross-validation within cross-validation, separating 1 fold for testing.
How does the second option for parameter tuning help?
It allows for optimal hyperparameters to be found using more data.
What is the advantage of using different hyperparameters on each fold during cross-validation?
It likely leads to the best results for small data sets.
What is a disadvantage of using different hyperparameters on each fold?
It requires more work and experiments than simpler methods and is not practical in all situations due to high computation needs.
What is the advantage of testing on all data when going into production?
You can use all available data to train the model for better performance.
What is a disadvantage of testing on all data?
You cannot estimate the performance of the final trained model anymore; you rely on hyperparameters generalizing.
What are the steps in CASE 1 for plenty of data available?
What are the steps in CASE 2 for limited data available?
What does a confusion matrix represent?
It visualizes performance, showing true labels vs. predicted labels, allowing analysis of model performance.
What is accuracy in model evaluation?
Accuracy = (TP + TN) / (TP + TN + FP + FN).
How is classification error calculated?
Classification error = 1 - accuracy.
What is precision in model evaluation?
Precision = TP / (TP + FP). It measures the correctness of positive predictions.
What does high precision indicate?
If a model predicts something as positive, it is likely to be correct.
What is recall in model evaluation?
Recall = TP / (TP + FN). It measures the ability to find all positive examples.
What is the precision for Class 1?
60%
What is the formula for recall?
Recall = \( \frac{TP}{TP + FN} \)
What is the recall for Class 1?
75%
What does high recall indicate?
Good at retrieving positive examples, but may include false positives.
What is the trade-off between precision and recall?
High precision often leads to low recall and vice versa.
What is macro-averaged recall for two classes?
62.5%
What does the F-measure combine?
It combines precision and recall into a single score.
What is the formula for F1 score?
\( F1 = \frac{2 \cdot precision \cdot recall}{precision + recall} \)
What does a confusion matrix evaluate?
It evaluates performance in multi-class classification.
What is accuracy in classification?
Accuracy = \( \frac{Number \ of \ correctly \ classified \ examples}{Total \ number \ of \ examples} \)
What is the difference between micro-averaging and macro-averaging?
Macro-averaging averages metrics at the class level; micro-averaging at the item level.
What is the effect of micro-averaging on precision, recall, and F1 in binary and multi-class classification?
They equal accuracy.
What is micro-averaged precision, recall, and F1 equal to?
Accuracy
What is the most common evaluation metric for regression tasks?
Mean Squared Error (MSE)
How is MSE calculated?
MSE = rac{1}{N} imes ext{sum}((Y_i - ilde{Y}_i)^2)
What does a lower MSE indicate?
Better predictions
What does RMSE stand for?
Root Mean Squared Error
How is RMSE calculated?
RMSE = ext{sqrt}(MSE)
What are the five important model characteristics in ML?
Accurate, Fast, Scalable, Simple, Interpretable
What is a balanced dataset?
Equal number of examples in each class
What is an imbalanced dataset?
Classes are not equally represented
What can affect accuracy in imbalanced datasets?
Performance of the majority class
What does macro-averaged recall help detect?
If one class is completely misclassified
What is a solution for imbalanced test sets?
Normalize counts in the confusion matrix
What does a normalized confusion matrix achieve?
Calculates metrics as if evaluated on a balanced dataset
What is one view of system performance on a balanced test set?
The classifier's performance remains the same.
What should be evaluated for a more realistic scenario?
The system should be evaluated with data having a realistic distribution.
What is one solution to balance classes?
Down-sample the majority class.
What is another solution to balance classes?
Up-sample the minority class.
What does overfitting indicate about model performance?
Good performance on training data, but poor generalization to other data.
What does underfitting indicate about model performance?
Poor performance on both training and test data.
What happens to classification error as models learn?
Classification error decreases for training but may increase for test data.
What can cause overfitting?
A model that is too complex or training data that is not representative.
How can we fight overfitting?
Choose optimal hyperparameters and use regularization.
What is a confidence interval?
A way to quantify confidence in an evaluation result.
What affects confidence in an evaluation result?
The size of the test set.
What is the impact of a small test set on accuracy?
90% accuracy on 10 samples differs from 84% accuracy.
What affects confidence in evaluation results?
The size of the test set affects confidence in evaluation results.
What is true error?
True error is the probability that the model misclassifies a randomly drawn example from a distribution.
How is true error mathematically defined?
True error is defined as: 𝑒𝑟𝑟𝑜𝑟𝐷(ℎ) ≡ Pr[𝑓(𝑥) ≠ ℎ(𝑥)].
What is sample error?
Sample error is the classification error based on a sample from the underlying distribution.
How is sample error mathematically defined?
Sample error is defined as: 𝑒𝑟𝑟𝑜𝑟𝑆(ℎ) ≡ (1/N) ∑ 𝛿(𝑓(𝑥), ℎ(𝑥)) for x ∈ S.
What does 𝛿(𝑓(𝑥), ℎ(𝑥)) represent?
𝛿(𝑓(𝑥), ℎ(𝑥)) = 1 if f(x) ≠ h(x), 0 if f(x) = h(x).
What is a confidence interval?
An N% confidence interval is an interval that is expected with probability N% to contain the parameter q.
What does a 95% confidence interval [0.2, 0.4] mean?
It means that with probability 95%, the true parameter q lies between 0.2 and 0.4.
How does sample size affect confidence intervals?
As sample size n increases, confidence interval boundaries get closer to 0, leading to narrower intervals.
What is the example confidence interval for errorS(h) = 0.22 with n = 50?
With n = 50 and ZN = 1.96, the confidence interval for errorD(h) is quite large (22%).
What does statistical significance testing help determine?
Statistical significance testing helps determine if there is a difference between two distributions of classification errors.
What does a graph with overlapping distributions indicate?
Overlapping distributions indicate uncertainty about which classifier is better due to sampling error.
What is the Marek ApprovedTM test?
The Marek ApprovedTM test is the Randomisation test, considered intuitive for comparing algorithms.
What is the Marek ApprovedTM test?
The Marek ApprovedTM test is the Randomisation test, as it is the most intuitive to him.
What do statistical tests determine?
Statistical tests tell us if the means of two sets are significantly different.
Name three statistical tests mentioned.
Randomisation, T-test, Wilcoxon rank-sum.
How does the Randomisation test work?
It randomly switches predictions between two systems and measures if the performance difference is greater or equal to the original difference.
What does a small p-value indicate?
A small p-value means we can be more confident that one system is different from the other.
What is the null hypothesis?
The null hypothesis states that the two algorithms/models perform the same and differences are due to sampling error.
What is the significance level for performance difference?
Performance difference is statistically significant if p < 0.05 (5%).
What is P-hacking?
P-hacking is the misuse of data analysis to find patterns that appear statistically significant without an underlying effect.
What happens if the number of experiments increases in P-hacking?
Increasing experiments can lead to a higher false discovery proportion, even if true discoveries remain the same.
What is the false positive rate in the example of P-hacking?
P(false positive) = 0.05, the same as the significance level.
What is the false discovery proportion in the initial example?
The false discovery proportion is 35 / 115 = 30%.
What happens to the false discovery proportion when experiments increase to 2400?
The false discovery proportion increases to 80 / 195 = 59%.
How many true discoveries were made?
80 true discoveries
How many false discoveries were made?
115 false discoveries
What is the false discovery proportion?
59%
What is the sample size of the 'study'?
54 people
How many possible relations were searched in the 'study'?
27,716 possible relations
What is a method to defend against unintentional p-hacking?
Adaptive threshold for calculating p-value (Benjamini & Hochberg, 1995)
What is the first step in the Benjamini-Hochberg method?
Rank the p-values from the M experiments
What does the Bejamini-Hochberg critical value formula represent?
New significance threshold (critical value)
What is the original significance threshold in the Benjamini-Hochberg method?
5%
What is the downside of the Benjamini-Hochberg method?
Thresholds for most experiments will be lower than the original 5%
What are Artificial Neural Networks (ANNs)?
A class of ML algorithms optimized with gradient descent
What does Deep Learning refer to?
Using neural network models with multiple hidden layers
Why has deep learning become more popular now?
Better conditions for implementation, like big data and faster hardware
What are perceptrons?
An early version of neural networks proposed in 1958 by Rosenblatt
What is backpropagation?
Described in 1974 by Werbos, it is a training algorithm for neural networks
What are LSTMs and CNNs?
Key components of modern neural network architectures described in the late '90s
What is a benefit of having large datasets for neural networks?
They improve training efficiency and effectiveness
What advancements have improved neural network training?
Better CPUs and GPUs for efficient computation
What operations can be efficiently parallelized on graphics cards?
Matrix operations
What has improved the accessibility and affordability of graphics cards?
Increased efficiency and reduced cost
What are automatic differentiation libraries used for?
They handle back propagation and optimization of model parameters
What is linear regression useful for in machine learning?
It serves as a stepping stone towards neural network models
What type of learning is linear regression?
Supervised learning
What does the dataset in supervised learning consist of?
Input and output pairs
What is the goal of supervised learning?
Learn the mapping f: X → Y
What does the function f represent in linear regression?
The mapping from inputs to outputs
What are the desired labels in classification problems?
Discrete labels
What are the desired labels in regression problems?
Continuous labels
What controls the gradient of a straight line in linear regression?
The parameter 'a'
What does the parameter 'b' represent in linear regression?
The y-intercept
What does the loss function measure in linear regression?
How well we are performing on our dataset
What is the formula for the loss function in linear regression?
E = (1/2) * Σ(ŷ(i) − y(i))^2
What does a smaller value of E indicate?
Predictions are close to real values
What do derivatives show in the context of linear regression?
How to change each parameter value to reduce loss
What is the purpose of gradient descent?
To repeatedly update parameters a and b
What does the learning rate (α) control in gradient descent?
The step size for updating parameters
What is the learning rate in gradient descent?
The learning rate, denoted as 𝛼, is a hyperparameter that determines the size of the steps taken towards the minimum of the loss function.
What does 𝜕𝐸/𝜕𝑎 represent?
It represents the partial derivative of the loss function with respect to parameter 𝑎.
What is the formula for updating parameter 𝑎?
The update rule is: 𝑎𝑛𝑒𝑤 := 𝑎𝑜𝑙𝑑 - 𝛼 ∑(ax(𝑖) + 𝑏 - 𝑦(𝑖))𝑥𝑖/N, where N is the total number of data points.
What does an epoch represent in machine learning?
An epoch is one complete pass over the entire dataset during training.
What is the gradient in vector notation?
The gradient is a vector of all partial derivatives for a function with K parameters: ∇𝜃f(𝜃) = [𝜕f(𝜃)/𝜕𝜃1, 𝜕f(𝜃)/𝜕𝜃2, ..., 𝜕f(𝜃)/𝜕𝜃𝐾].
What is the analytical solution for linear regression?
The analytical solution allows finding optimal parameters without iterating through epochs by solving a specific equation.
What is the complexity of matrix inversion?
Matrix inversion has cubic complexity, making it computationally expensive for large problems.
What is multiple linear regression?
Multiple linear regression uses multiple input features, each with its own parameter, to predict an output value.
How does the RMSE change with multiple features?
The RMSE (Root Mean Square Error) is typically lower with multiple features due to increased information for prediction.
What is RMSE in model evaluation?
Root Mean Square Error (RMSE) measures the differences between predicted and observed values; lower RMSE indicates better model accuracy.
How does using more features affect model predictions?
Using more features provides more information, leading to more accurate predictions in the model.
What does a linear regression model represent in higher dimensions?
In higher dimensions, the linear regression model is a continuous linear plane representing the learned data.
What is the role of the nucleus in a biological neuron?
The nucleus acts like the neuron's brain, telling it what to do.
What do dendrites do in a biological neuron?
Dendrites connect to other neurons and receive signals from them.
What happens when a biological neuron's axon fires?
When conditions are right, the axon fires a signal to connect with other neurons' dendrites.
What are input features in an artificial neuron?
Input features (xi) are the values fed into the artificial neuron, each with an associated weight (θi).
What determines the importance of a feature in an artificial neuron?
The weight (θi) associated with each input feature determines its importance in the artificial neuron.
What does the output of an artificial neuron involve?
The output involves multiplying features and weights, and adding the bias (b).
What is the activation function in an artificial neuron?
The activation function (g) transforms the output of the linear equation into a new value.
How can the bias term be included in the equation?
The bias term can be included by reformulating the equation to add an extra feature and weight for the bias.
What is the vector notation for input features and weights?
Input features and weights can be represented as vectors: x = [x1, x2, ..., xK], W = [θ1, θ2, ..., θK].
What is the logistic activation function used for?
The logistic function (sigmoid) squashes any value into a range between 0 and 1.
What does logistic regression actually do?
Logistic regression performs binary classification using the logistic function, not actual regression.
How is the logistic regression model optimized?
The logistic regression model is optimized using gradient descent.
What is a perceptron?
A perceptron is an algorithm for supervised binary classification, an early version of an artificial neuron.
What activation function does a perceptron use?
A perceptron uses a threshold function as its activation function, outputting 0 until a certain limit is reached.
What does gradient descent use as its activation function?
A threshold function that outputs 0 until a limit (θ) is reached, then outputs 1.
What is the output of the activation function in the perceptron?
1 if WT x > 0, otherwise 0.
What is the perceptron learning rule update formula?
θ_new ← θ_old + α(y - h(x))xi
What happens when y = 1 and h(x) = 0?
Weight θi is increased if xi is positive, decreased if negative.
What happens when y = 0 and h(x) = 1?
We want to decrease the summation, so we do the opposite to reduce WT x.
What types of functions can a perceptron learn?
Any linearly separable function, like logical OR.
Why can't a perceptron learn XOR?
XOR is not linearly separable; one linear line cannot separate the classes.
What is a weakness of using a single neuron?
It cannot classify complex relationships like XOR.
What is needed to model complex relationships in data?
Multi-layer neural networks are required.
What is a multi-layer perceptron (MLP)?
A network that connects neurons in sequence to learn higher order features.
What is the role of hidden layers in a neural network?
They process features and are not visible from the outside.
What does each block in a block diagram represent?
A layer of the model with multiple neurons.
What is the first and last layer of a neural network called?
The first layer is the input layer and the last is the output layer.
What should you check when something isn’t working in a neural network?
Ensure that the matrix dimensions match.
What is b in the context of a neural network layer?
The layer-specific bias vector, unique to each neuron in a layer.
What should you check when working with matrices?
Matrix dimensions must match.
What is 'b_' in a neural network?
Layer-specific bias vector for each neuron.
How many neurons are typically in deep neural networks?
Thousands or millions of neurons.
What can multi-layer neural networks learn?
Useful representations and features.
What was the approach to feature crafting before multi-layer networks?
Manually crafting features for pattern recognition.
What is end-to-end learning?
Allowing the network to learn features from raw input.
What do lower levels of a neural network act as?
Feature extractors.
What do higher levels of a neural network learn?
Act as the classification layer.
What is the benefit of training both feature extraction and classification layers together?
They optimize each other based on data.
What should you use if the data is linearly separable?
A linear function for the model.
What happens if we only use linear activation functions in a multi-layer network?
It becomes equivalent to a single-layer network.
What is the simplest activation function?
Linear activation (identity function).
What does the output of a neuron with linear activation become?
ŷ = f(𝑊𝑇𝑥) = 𝑊𝑇𝑥.
What is the equation for output in a two-layer network?
ŷ = W1(𝑊2𝑥) → 𝑦 = 𝑈𝑥 where 𝑈 = 𝑊1𝑊2.
What is the equation for a linear activation function?
ŷ = W1(W2x) → y = Ux where U = W1W2
What happens when a two-layer network uses linear activation?
It collapses into a single-layer network, unable to capture complex non-linear patterns.
What do non-linear activation functions do?
They allow models to learn complicated patterns by breaking the dependency of multiple layers collapsing into one.
What is the range of the sigmoid activation function?
The sigmoid function compresses output into the range between 0 and 1.
What is the formula for the sigmoid activation function?
f(x) = σ(x) = 1 / (1 + e^(-x))
What is the range of the tanh activation function?
The tanh function maps input values to the range -1 to 1.
What is the formula for the tanh activation function?
f(x) = tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
What characterizes the ReLU activation function?
ReLU is linear and unbounded in the positive part, but non-linear overall.
What is the formula for the ReLU activation function?
f(x) = ReLU(x) = { 0 for x ≤ 0; x for x > 0 }
What does the softmax activation function do?
It scales inputs into a probability distribution that sums to 1.
What is the formula for the softmax activation function?
softmax(zi) = e^(zi) / ∑ e^(zk)
What is a common activation function for deep neural networks?
ReLU is commonly used in very deep neural networks, especially for image recognition.
Which activation functions are more robust than ReLU?
Tanh and sigmoid are more robust than ReLU.
What is a potential issue with using ReLU?
ReLU can produce unbounded values, leading to confusion in the network.
What should you try first when designing models?
Experiment with tanh and sigmoid first, as they are bounded.
How should the choice of activation function in hidden layers be treated?
It is a hyperparameter that can be set empirically or optimized using a development set.
How can we set hyperparameters for activation functions?
Empirically or using a development set to find the best performing function for the model and dataset.
What determines the choice of activation function in the output layer?
It depends on the task.
What activation function is commonly used for binary classification?
Sigmoid is most common; tanh can also be used.
What activation function should be used for predicting unbounded scores?
Use a linear activation function.
What activation function is most commonly used for predicting a probability distribution?
Softmax is used for multi-class classification.
What does Softmax do?
It scales values into a probability distribution, making them sum to 1.
What is the input dimension for the neural network in PyTorch?
The input dimension is 10.
How many neurons are in the hidden layer of the PyTorch neural network?
There are 5 neurons in the hidden layer.
What is the output dimension of the PyTorch neural network?
The output dimension is 1.
What activation function is applied in the hidden layer during the forward pass?
Tanh is used as the activation function.
What is the purpose of the loss function in neural networks?
To minimize and show performance on a specific task.
How do we update parameters in neural networks?
Using gradient descent to minimize the loss function.
What is the formula for updating parameters in gradient descent?
\( \theta_i^{(t+1)} = \theta_i^{(t)} - \alpha \frac{\partial E}{\partial \theta_i^{(t)}} \)
What type of task is a regression task?
Predicting a continuous variable, like velocity or price.
What is the goal of a regression task?
To predict a continuous variable.
What is an example of a regression task?
Predicting the price of a house.
What activation function is often used in the output layer for regression?
Linear activation.
What loss function is commonly used in regression?
Mean Squared Error (MSE).
What is the formula for Mean Squared Error (MSE)?
MSE = \frac{1}{N} \sum_{i=1}^{N}(\hat{y}_i - y_i)^2
What does MSE equal when predictions are correct?
0.
What is the primary goal of classification tasks?
To choose between different categories or discrete options.
What is binary classification?
Classification with only 2 possible classes.
What is multi-class classification?
Classification with more than 1 class, where each input belongs to exactly 1 class.
What is multi-label classification?
Each input can belong to multiple classes.
What is the loss function used in classification?
Cross-entropy.
What do we want to maximize in classification?
The likelihood of the network assigning correct labels.
What is the probability formulation for binary classification?
\prod_{i=1}^{N} (\hat{y}^{(i)})^{y^{(i)}} (1 - \hat{y}^{(i)})^{(1-y^{(i)})}
What happens if the network assigns the correct label for every data point?
The product approaches 1.
What is the issue with multiplying probabilities in classification?
It can lead to underflow errors.
How can we avoid underflow errors in classification?
By maximizing the logarithm of the probability formula.
What is the formula for binary cross-entropy loss?
-\sum_{i=1}^{N} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]
What is the formula for binary cross-entropy loss?
\( L = -\frac{1}{N} \sum_{i=1}^{N} [y(i) \log(\hat{y}(i)) + (1 - y(i)) \log(1 - \hat{y}(i))] \)
What does normalizing by the number of data points do in loss calculation?
It makes the loss magnitude independent of the number of data points.
What is categorical cross-entropy?
It generalizes binary cross-entropy for multiple classes.
What is the formula for categorical cross-entropy loss?
\( L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_c(i) \log(\hat{y}_c(i)) \)
In categorical cross-entropy, what does y_c represent?
1 if C is the correct class for data point i, 0 otherwise.
What is the output layer configuration for a multi-class classification neural network example?
An output layer with 3 neurons predicting probabilities over 3 flower types.
What activation function is commonly used with categorical cross-entropy loss?
Softmax activation.
What is batching in neural networks?
Combining vectors of several data points into a matrix for simultaneous processing.
Why is batching beneficial for training neural networks?
It increases speed and reduces noise, leveraging GPU efficiency.
What does batching allow GPUs to do more efficiently?
Perform matrix multiplications in parallel.
How does batching assist in regularization during optimization?
It combines updates from several data points easily.
What is the benefit of batching in neural networks?
Combines updates from several datapoints, making updates more stable and accurate.
What is the input matrix X in a neural network?
A batch of data points with dimensions n x k, where n is data points and k is feature vectors.
What does the first layer in a neural network apply?
A linear transformation using a weight matrix and adding a bias.
What is Z in the context of a neural network?
The output matrix after applying the weight matrix and bias in the first layer.
What do we get after applying the activation function to Z?
A, the output of the first hidden layer.
What is the purpose of calculating loss in a neural network?
To determine how well the model performs.
What method is used to update model parameters in neural networks?
Gradient descent is used to update weight matrices and biases.
What is backpropagation in neural networks?
A method to calculate necessary partial derivatives iteratively.
How does backpropagation simplify calculations?
It breaks down calculations into smaller steps, moving backwards through the network.
What is the chain rule used for in neural networks?
To calculate the derivative of a composite function.
What does the chain rule formula represent?
It shows how to break down derivatives into smaller parts for easier calculation.
How can we find the partial derivative of the loss with respect to W[1]?
By breaking it down through Z[1] and A[1] using their respective derivatives.
What are the two types of partial derivatives in backpropagation?
The output of an activation function w.r.t its input and the output of a linear transformation w.r.t its input.
What is the purpose of the partial derivative in backpropagation?
To update the weights of the linear transformation in the neural network.
What does the partial derivative of a matrix w.r.t another matrix represent?
A 4-D tensor containing the partial derivatives of every element in the first matrix w.r.t every element in the second.
What is the linear transformation notation used in backpropagation?
Z = XW, where Z is the output, X is input, and W is weights.
What do you need to calculate to update weights in a linear transformation?
The partial derivative of the loss w.r.t the weights and the bias vector.
What is the shape of the partial derivative of a scalar w.r.t a matrix?
It has the same shape as the original matrix itself.
What is the key component in the derivatives during backpropagation?
The partial derivative of the loss w.r.t the output of the linear transformation.
What does backpropagation iteratively calculate?
Partial derivatives, taking them from the top layers and passing them down.
What is the bias vector used for in backpropagation?
It is repeated for each neuron in the layer to add the same bias to each input vector.
What is necessary for lower levels to calculate their own partial derivatives?
The gradient of the loss w.r.t the input and the weight's partial derivative.
What rule is used to break down the calculations in backpropagation?
The chain rule.
What is the significance of the dimensions N, D, and M in backpropagation?
They represent the number of inputs, dimensions, and outputs respectively.
What happens during the forward pass in a neural network?
The operation takes X and W as inputs and produces output Z.
What does the partial derivative of the loss w.r.t one element depend on?
It depends on the weights it multiplies with and the loss of whatever uses this element.
How many output values does the particular element affect?
It affects exactly 3 output values: z1,1, z1,2, and z1,3.
What is the equation for the partial derivative of the element?
The equation uses the chain rule and involves the weight w1,1 and the partial derivative of z1,1 w.r.t x1,1.
What happens when you calculate the partial derivative w.r.t the full matrix X?
It can be expressed as a dot product of two matrices.
What do the two matrices in the dot product represent?
The first is the partial derivative of the loss w.r.t Z, and the second is the transposed weight matrix for the layer.
What is the importance of backpropagation for inputs X?
It is a simple way of calculating backpropagation for inputs in a given layer.
How do we calculate the partial derivative w.r.t the weights?
By breaking it down for one individual weight, considering its effect on the output.
What does one weight affect in the output?
One weight affects two values in the output for two data points in the batch.
What is the equation for the partial derivative of the loss w.r.t the weights?
It is a dot product of the partial derivative of the loss w.r.t Z and the transposed matrix of features XT.
What do we need to calculate for the bias vector?
The partial derivative of the loss w.r.t the bias vector.
What result do we get for the partial derivative of the loss w.r.t the bias?
It is equal to a transposed column vector of 1s times the partial derivative of the loss w.r.t z.
What is needed to perform full backpropagation through the neural network?
How to handle the activation functions.
How are activation functions generally applied?
They are applied element-wise.
What is the purpose of activation functions in a neural network?
Activation functions are applied element-wise to introduce non-linearity, allowing the network to learn complex patterns.
Do activation functions have parameters that need updating during training?
No, activation functions generally do not have parameters that need to be updated during training.
What is the derivative of an activation function denoted as?
The derivative of an activation function is denoted as g′(x).
What does the chain rule help with in back propagation?
The chain rule helps calculate the partial derivative of the loss with respect to the inputs of the activation function.
What is the derivative of the Linear activation function?
For Linear: g(z) = z, g′(z) = 1.
What is the formula for the Sigmoid activation function?
For Sigmoid: g(z) = 1/(1 + e^(-z)), g′(z) = g(z)(1 - g(z)).
What is the formula for the Tanh activation function?
For Tanh: g(z) = (e^z - e^(-z))/(e^z + e^(-z)), g′(z) = 1 - g(z)².
What is the ReLU activation function and its derivative?
For ReLU: g(z) = z for z > 0, 0 for z ≤ 0; g′(z) = 1 for z > 0, 0 for z ≤ 0.
How is Softmax different from other activation functions?
Softmax takes a whole vector as input and outputs a whole vector, unlike other activation functions applied element-wise.
What is the purpose of combining Softmax with cross-entropy?
Combining Softmax with cross-entropy simplifies the backpropagation of derivatives for classification tasks.
What does the joint partial derivative through Softmax and cross-entropy represent?
It represents the predictions minus the true class labels, normalized by N if applicable.
What is gradient descent?
Gradient descent is an optimization algorithm that updates parameters by taking small steps in the negative direction of the gradient.
What is the formula for updating weights in gradient descent?
W_new = W_old - α * (∂L/∂W), where α is the learning rate.
What is the learning rate in gradient descent?
The learning rate (α) is a hyperparameter that determines the step size for updating model parameters.
What is the formula for updating weights in gradient descent?
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝛼 \( \frac{\partial L}{\partial W} \)
What does α represent in gradient descent?
Learning rate/step size, a hyperparameter based on the development set.
What must be true for gradients to be computed in neural networks?
Network functions and the loss need to be differentiable.
What is the first step in the general algorithm for gradient descent?
Initialise weights randomly.
What is the termination condition in gradient descent?
When the loss function does not improve anymore.
What is a common issue when updating weights during backpropagation?
Updating weights before finishing using original weights can cause errors.
What is Stochastic Gradient Descent (SGD)?
Calculating the gradient based on one data point and updating weights immediately.
What are the steps in Stochastic Gradient Descent?
What is Mini-batched Gradient Descent?
A balance between batch and stochastic gradient descent, using batches of data points.
What are the steps in Mini-batched Gradient Descent?
What is a challenge in optimising neural networks?
Finding the lowest point on complex loss surfaces is difficult.
Why is the learning rate important?
The size of the learning rate significantly affects the training process.
What happens if the learning rate is too low?
Optimization can take a very long time to reach a good minimum.
What happens if the learning rate is too high?
We can step over the correct solution.
What is the ideal state of the learning rate?
It allows reaching the minimum of the loss function in a reasonable number of steps.
What is the learning rate?
A hyperparameter that needs to be chosen based on the development set.
What are adaptive learning rates?
Different learning rates for each parameter in the model.
What happens if a parameter has not been updated for a while?
The learning rate for that parameter may be increased.
What happens if a parameter is making big updates?
The learning rate for that parameter may be decreased.
What algorithms work well for adaptive learning rates?
The 'Adam' and 'AdaDelta' algorithms.
What is learning rate decay?
Scaling the learning rate by a value between 0 and 1.
What is the intuition behind learning rate decay?
Take smaller steps as we approach the minimum to avoid overshooting.
When can learning rate decay be performed?
Every epoch, after a certain number of epochs, or when validation performance doesn't improve.
What is the simplest approach to weight initialization?
Setting weights to zeros.
Why should we not set all weights to zero?
Neurons will learn the same things, leading to the same optimized values.
What is a common method for weight initialization?
Drawing randomly from a normal distribution with mean 0 and variance 1 or 0.1.
What does Xavier Glorot initialization do?
Draws values from a uniform distribution based on the number of neurons in layers.
What is the formula used in Xavier Glorot initialization?
Weights are drawn from a uniform distribution defined by boundaries involving the number of neurons.
What role does randomness play in neural networks?
It is important for various aspects of the learning process.
What role does randomness play in neural networks?
Different random initialisations lead to different results and performance.
What is the solution to controlling randomness in neural networks?
Explicitly set the random seed for all random number generators used.
What can happen when processes are parallelised on GPUs?
They can produce randomly different results due to different threads running at different times.
How should you report model performance under different random seeds?
Report the mean and standard deviation of the performance.
What is min-max normalisation?
Scaling the smallest value to a and the largest to b, e.g., [0, 1] or [-1, 1].
What is the formula for min-max normalisation?
X′ = a + (X - Xmin)(b - a) / (Xmax - Xmin)
What is standardisation (z-normalisation)?
Scaling the data to have mean 0 and standard deviation 1.
What is the formula for standardisation?
X′ = (X - μ) / σ
Why is normalisation important in neural networks?
It helps weight updates to be proportional to the input, improving model learning accuracy.
What should you remember about normalisation for data columns?
Normalise each column separately, not the entire matrix.
How should normalising constants be calculated?
Calculate them based only on the training set and apply them to test/evaluation sets.
What is gradient checking?
A method to verify if the gradient is calculated correctly in the implementation.
What are the two methods to isolate the gradient?
What is the formula for the gradient using weight difference?
∂L(w)/∂w = (w(t-1) - w(t)) / α
What is the formula for measuring change in loss?
∂L(w)/∂w ≈ (L(w + ε) - L(w - ε)) / (2ε)
What is the definition of a partial derivative?
The partial derivative of L(w) with respect to w is defined as: \( \frac{\partial L(w)}{\partial w} = \lim_{\epsilon \to 0} \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon} \)
What indicates a bug in neural network training?
If the values from different methods of calculating partial derivatives are not similar, it indicates a bug.
What is overfitting in neural networks?
Overfitting occurs when a model learns the training data too well, failing to generalize to unseen data.
How can overfitting be prevented?
To prevent overfitting, use held-out validation and test sets to measure generalization performance.
What is network capacity?
Network capacity refers to the number of parameters in a model and its ability to overfit the dataset.
What does it mean if a model is underfitting?
Underfitting means the model performs poorly on both training and validation sets due to insufficient capacity.
How can you improve a model that is underfitting?
Increase the number of neurons, parameters, or layers in the model to improve learning.
What indicates a model is overfitting?
Overfitting is indicated by good performance on the training set but poor performance on the validation set.
What is one method to prevent overfitting?
Limit the number of parameters in the model to prevent memorization of the dataset.
What is the best solution to overfitting?
The best solution to overfitting is to acquire more data for training.
What is early stopping in neural network training?
Early stopping is a method where training is halted when performance on the validation set does not improve for a set number of epochs.
What is regularization in the context of neural networks?
Regularization adds constraints to the model to prevent overfitting, such as penalizing large weights.
What are L2 and L1 regularization?
L2 regularization adds squared weights to the loss function, while L1 regularization adds absolute weights, both helping to control model complexity.
What does L2 regularization do to weights?
L2 regularization penalizes larger weights more, encouraging sharing between features and pushing weights towards 0.
What does L2 regularisation do?
Adds squared weights to the loss function, penalising larger weights more and encouraging sharing between features.
What is the formula for L2 regularisation loss function?
The formula is: 𝐽(𝜃) = 𝐿𝑜𝑠𝑠(𝑦, ŷ) + 𝜆 ∑𝑤^2
How does L2 regularisation affect weight updates?
The update rule is: 𝑤 ← 𝑤 − 𝛼(𝜕𝐿𝑜𝑠𝑠/𝜕𝑤 + 2𝜆𝑤)
What is the role of the hyperparameter λ in L2 regularisation?
Controls the importance of regularisation, usually set to a low value (e.g., 0.001).
What does L1 regularisation do?
Adds the absolute value of weights to the loss function, using the sign of the weight for updates.
What is the formula for L1 regularisation loss function?
The formula is: 𝐽(𝜃) = 𝐿𝑜𝑠𝑠(𝑦, ŷ) + 𝜆 ∑|𝑤|
How does L1 regularisation affect weight updates?
The update rule is: 𝑤 ← 𝑤 − 𝛼(𝜕𝐿𝑜𝑠𝑠/𝜕𝑤 + 𝜆 𝑠𝑖𝑔𝑛(𝑤))
How do L1 and L2 regularisation differ in weight management?
L2 pushes all weights towards 0, while L1 encourages sparsity, keeping many weights at 0.
What is dropout in neural networks?
A method to reduce overfitting by randomly setting some neural activations to 0 during training.
What percentage of neurons are typically dropped during training with dropout?
About 50% of neurons are typically dropped at each backward pass.
What happens during testing when using dropout?
All neurons are used, but inputs are scaled to match training expectations.
What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data, while unsupervised learning uses only feature values without labels.
What is the objective of unsupervised learning?
To find hidden structures in the dataset without ground-truth labels.
What is unsupervised learning?
A type of learning where the dataset consists only of feature values without ground-truth labels.
What is the objective of unsupervised learning?
To find hidden structures in the dataset for making inferences or decisions.
What is clustering in unsupervised learning?
The task of finding groups ('clusters') of samples that might belong to the same class.
What is density estimation?
Finding the probability of seeing a point in a certain location compared to another location.
What is dimensionality reduction?
A process to reduce the number of features while retaining important information.
Name a famous algorithm for dimensionality reduction.
Principal Component Analysis (PCA).
What does clustering imply about intra-cluster variance?
There is low intra-cluster variance among instances in the same cluster.
What is the k-means algorithm used for?
To identify a specified number of clusters in a dataset.
What are the steps of the k-means algorithm?
Initialisation, Assignment, Update, and checking for convergence.
What is a cluster in clustering?
A set of instances that are similar to each other and dissimilar to instances in other clusters.
How does clustering help in vector quantization?
It improves encoding by clustering information in a datastream to reduce data size.
What is an example of using clustering in nature?
Identifying different species of flowers by plotting features like petal length vs. sepal width.
What is the structure of an unsupervised learning task?
A feature space with datapoints lacking additional information like labels or values.
What does k represent in k-means clustering?
The number of clusters, e.g., k = 3 means there are 3 centroids.
What is the first step in the k-means algorithm?
Initialisation: Select k random instances or generate random vectors for centroids.
What is the goal of the assignment step in k-means?
Assign every point in the dataset to the nearest centroid.
How do we update centroids in k-means?
By computing the average position of all points in each cluster.
What is checked during the convergence step in k-means?
The displacement of centroids; if it's larger than a threshold, loop back to assignment.
What are Voronoi diagrams?
Diagrams that create decision boundaries equidistant between centroids.
What is the formula for the assignment step in k-means?
orall i ext{ in } ext{{1,…,N}} ext{ } c(i) = ext{argmin}_{k ext{ in } ext{{1,…,K}}} ext{ } orm{x(i) - oldsymbol{ u}_k}^2.
What does the update formula in k-means compute?
The average location for all samples assigned to cluster k.
What condition indicates convergence in k-means?
If orall k ext{ } |oldsymbol{ u}_k^t - oldsymbol{ u}_k^{t-1}| < oldsymbol{ ext{ε}}.
What is checked in Step 4 of K-means?
Convergence by computing the movement of centroids between timesteps.
What indicates to stop iterating in K-means?
If the movement of centroids is lower than a certain threshold (𝜖).
How is K-means viewed as a model?
As a model optimization problem with centroid locations and data point assignments.
What is the objective of K-means?
Minimize the loss function L for assignments of data points to centroids.
What does the loss function L represent?
The mean distance between samples and their associated centroid.
What is the significance of K in K-means?
K is a crucial hyperparameter that affects the clustering results.
What is the Elbow Method used for?
To determine the optimal value of K by plotting loss values against K.
What should be selected according to the Elbow Method?
The value of K where the rate of decrease in loss sharply shifts.
What does cross-validation help determine?
The best value for hyperparameters using a validation set.
What are the strengths of K-means?
Simple, popular, and efficient with linear complexity.
What is a significant weakness of K-means?
The need to define K, which significantly impacts results.
What is a significant hyperparameter in K-means?
K (the number of clusters)
What is a weakness of K-means regarding its results?
It only finds a local optimum and is sensitive to initial centroid positions.
What technique can improve K-means initialization?
K-means++
When is K-means applicable?
When a distance function exists on the dataset, typically with real values.
What algorithm works with categorical data in clustering?
K-mode algorithm
How does K-medioid algorithm differ from K-means?
It is less sensitive to outliers by using the geometric median.
What shape must clusters have for K-means to work effectively?
Clusters must be hyper-ellipsoids (or hyper-spheres).
What is the objective of density estimation algorithms?
To estimate the probability density function p(x) from data.
What does a Probability Density Function (PDF) model?
The likelihood of a continuous variable being observed within an interval.
What must the integral of a PDF over its range equal?
1
What is one application of density estimation?
Anomaly/novelty detection.
What is the goal of generative models in relation to probability?
To model the distribution of a class as p(X | y).
What do discriminative models directly model?
The probability of observing label y given sample values X, p(y | X).
What activation function transforms neural network output into a probability distribution?
Softmax activation.
What does the Softmax activation do?
Transforms the output of the neural network into a probability distribution.
What is Bayes’ rule used for in generative models?
To turn the generative model into a discriminative classifier.
What is the formula for Bayes’ rule?
\( p(y | X) = \frac{p(X | y)p(y)}{p(X)} \)
What do non-parametric methods assume about function shape?
They make no assumptions about the form/shape of the function.
What is an example of a non-parametric method?
k-NN algorithm.
What is the bias and variance characteristic of non-parametric methods?
Low bias; high variance depending on the data.
What do histograms do in density estimation?
Group data into bins, count occurrences, and normalize.
What does normalization ensure in histograms?
The integral of the function sums to 1, making it a valid PDF.
What is Kernel Density Estimation?
Estimates the density of a function by using a kernel around training examples.
What does the kernel function do in density estimation?
Computes the difference with the current point x and normalizes according to bandwidth.
What is a Parzen window?
A method used in kernel density estimation to define the kernel.
What type of distribution can be used as a kernel in density estimation?
Gaussian distribution.
What are the characteristics of parametric approaches?
Make assumptions about the shape, inducing bias but fixing the number of parameters.
What is the univariate Gaussian distribution parameterized by?
Mean (μ) and variance (σ).
What is ensured by the normalization factor in Gaussian distribution?
The integral of the distribution sums to 1.
What does the multivariate Gaussian distribution take as input?
A multi-dimensional vector.
What is the input of the Multivariate Gaussian Distribution?
A multi-dimensional vector.
What replaces variance in the Multivariate Gaussian Distribution?
The covariance matrix Σ.
What is the purpose of the normalization term in the Multivariate Gaussian Distribution?
To ensure the double-integral sums to 1.
What does likelihood determine in a model?
How good the model is at capturing the probability of generating data x.
What assumption is made about the datapoints in the training set?
They follow i.i.d distributions.
What do we multiply to get the likelihood in a dataset?
The predicted values from the models for every sample with parameters θ.
Why do we calculate negative log-likelihood instead of likelihood?
To turn maximization into minimization, similar to training a neural network.
What does Gaussian fitting minimize?
The negative log likelihood.
What happens when you take the log of a multiplication term?
Multiplications turn into sums.
Is the Gaussian distribution sufficient for modeling densities in all cases?
No, it may not be satisfactory for all data distributions.
What is the problem with fitting a Gaussian distribution to bimodal data?
It induces bias and may not capture the data's characteristics.
What is a potential solution to the limitations of Gaussian distributions?
Using mixture models to capture different modes of the distribution.
How is the PDF of mixture models defined?
As the weighted sum of multiple PDFs: 𝑝(𝑥) = ∑ 𝜋𝑘𝑝𝑘(𝑥).
What constraints does the mixing proportion 𝜋𝑘 follow?
0 ≤ 𝜋𝑘 ≤ 1 and ∑ 𝜋𝑘 = 1.
What does the Gaussian Mixture Model (GMM) estimate?
The probability density with p(x) from multiple Gaussian distributions.
What is the Gaussian Mixture Model a weighted sum of?
Gaussians, ensuring the PDF integrates to 1.
What is a Gaussian Mixture Model (GMM)?
A GMM is a weighted sum of Gaussians.
What does GMM ensure about the PDF?
The GMM ensures that the PDF integrates to 1, even if it is a mixture of multiple PDFs.
What is the purpose of GMMs?
GMMs can model complicated data, including multi-modal data.
What algorithm is used to fit GMM to training examples?
The Expectation Maximisation (EM) algorithm is used.
What are the two main steps of the EM algorithm?
The two main steps are the E-step (expectation) and the M-step (maximisation).
What is done in the E-step of the EM algorithm?
Responsibilities for each training example and each mixture component are computed.
How is the responsibility calculated in the E-step?
Using the formula: \( r_{ik} = \frac{\pi_k \mathcal{N}(x^{(i)} | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x^{(i)} | \mu_j, \Sigma_j)} \)
What is updated in the M-step of the EM algorithm?
GMM parameters are updated using the computed responsibilities.
How is the mean updated in the M-step?
The mean is updated using: \( \mu_k = \frac{1}{N_k} \sum_{i=1}^{N} r_{ik} x^{(i)} \)
What is checked for convergence in the EM algorithm?
Convergence is checked by monitoring changes in parameters or log likelihood.
What is the Bayesian Information Criterion (BIC)?
BIC is used to select the number of components K in GMM.
What is the formula for BIC?
\( BIC_k = \mathcal{L}(K) + \frac{P_k}{2} \log(N) \)
What does \( \mathcal{L}(K) \) represent in the BIC formula?
\( \mathcal{L}(K) \) is the negative log likelihood.
What is the penalty term in the BIC formula?
The penalty term is \( \frac{P_k}{2} \log(N) \), which penalizes complex models.
What does N represent in the BIC formula?
N is the number of examples in the dataset.
What is the formula for Ck?
Ck = ℒ(K) + Pk/(2 log(N))
What does ℒ(K) represent?
ℒ(K) is the negative log likelihood encouraging fitting of data.
What does Pk/(2 log(N)) represent?
It is the penalty term that penalizes complex models.
What does N represent in the context?
N is the number of examples in the dataset.
What does Pk represent?
Pk is the number of parameters.
How many parameters does a 2D Gaussian have?
Pk = 6K - 1.
What are the parameters for the mean in 2D Gaussian?
2 parameters for the mean (2D vector).
How many parameters are needed for covariance in 2D Gaussian?
3 parameters for the covariance (symmetric 2x2 matrix).
What is the purpose of the -1 in the parameter count?
It accounts for the constraint that the sum of mixing proportions must equal 1.
What principle is suggested for model selection?
Occam’s Razor: pick the simplest model that fits.
What happens to BIC values as K increases?
BIC values decrease sharply then rise again due to penalty dominance.
What is cross-validation used for?
To find the most appropriate number of components K.
What are the steps in cross-validation for GMM-EM?
What is a key similarity between GMM and K-means?
Both require selecting the most appropriate K value for clusters/components.
What does convergence mean in GMM and K-means?
Convergence occurs when changes in parameters are sufficiently small.
How does GMM initialization relate to K-means?
GMM means are often initialized from K-means centroid locations.
What is soft clustering in GMM?
Every point belongs to several clusters with varying degrees of membership.
What distance metric is used in GMM?
Distance is related to Mahalanobis distance, encoded by the covariance matrix.
What is the focus of Module 7?
Evolutionary algorithms, including genetic algorithms.
What is the purpose of genetic/evolutionary algorithms?
Optimization for black box functions.
What is a Genetic/Evolutionary Algorithm?
An optimisation method for black box functions without knowing the mathematical equation or gradient, inspired by natural evolution and genetics.
What is reinforcement learning?
Learning to maximize a numerical reward, considered an optimization problem.
What do traditional RL algorithms deal with?
Discrete states and action spaces.
What do policy search algorithms deal with?
Continuous search spaces represented as 𝑥∗ = argmax 𝑥 𝑓(𝑥).
What are Black-Box Optimisation Algorithms?
Algorithms where the links between parameters are unknown at the start of training.
What is an example of black-box optimisation in robotics?
The unknown relationship between speed and joint movements.
What year did Darwin publish his theory about the origin of species?
1859.
What are the four main concepts of Darwin's theory?
Who discovered principles of statistical inheritance?
Mendel in 1866.
What did Weismann discover in 1883?
Acquired traits are not passed to offspring.
What did Watson, Crick, and Franklin discover in 1953?
The structure of DNA.
What is a gene?
A sequence of nucleotides in DNA that codes a particular trait.
What is a genotype?
A set of genes (parameters).
What is a phenotype?
The physiological expression of the genotype.
What are the three main families of genetic algorithms proposed in the 60s?
What is Genetic Programming?
The evolution of programs, defining and computing a program as a tree.
What is the main concept of genetic/evolutionary algorithms?
They have a population of solutions encoding genotypes, which are developed into phenotypes for evaluation.
What happens to the worst-performing functions in genetic algorithms?
They are removed (killed), and crossover mutation occurs with better-performing functions.
What is observed in the black box function?
The output helps to rank the phenotypes.
What happens to the worst functions in the process?
They can be removed (killed).
What is done to better performing functions?
Cross-over mutation is applied to generate new solutions (offsprings).
What is the result of repeating the evolutionary process?
The solution converges to an optimal high-performing solution.
What principle do these algorithms use as a base?
A simplified version of Neo-Darwinism.
How is each solution represented in evolutionary algorithms?
Each solution is represented by a genotype.
What function measures the performance of phenotypes?
A fitness function is used.
What is the selection operator?
It selects the solutions that will be reproduced.
What does the cross-over operator do?
It mixes the parents’ genotype to create the offspring.
What is the mutation operator?
It applies variations to the genotype after reproduction.
What is the genotype in a genetic algorithm?
A binary string of fixed size (e.g., 01001010).
What is the genotype in genetic programming?
A program represented as a tree (often in LISP).
What does the mutation in evolutionary strategies draw from?
It draws from a Gaussian distribution.
What term is used to describe the blurred lines between algorithm families?
Evolutionary Algorithms.
What is the goal of the Mastermind game?
Finding the secret combination of colors.
How many colors can each piece have in Mastermind?
Each piece can have 6 different colors.
What is the fitness function for Mastermind?
F(x) = p1 + 0.5*p2.
What does p1 represent in the Mastermind fitness function?
The number of pieces with the right color and correct position.
What does p2 represent in the Mastermind fitness function?
The number of pieces with the right color but wrong position.
What does p2 represent in the context of fitness functions?
Number of pieces with the right colour but the wrong position.
What is the formula for the fitness function F(x)?
F(x) = p1 + 0.5*p2
What is the goal of evolutionary algorithms regarding fitness functions?
Maximize the fitness function.
What is F(x) for solving the problem in this context?
F(x) = 4
What is the fitness function for teaching a robot to walk?
F(x) = walking speed = travelled distance after a few seconds.
What is the fitness function for teaching a robot to throw an object?
F(x) = distance(object, target).
What do genotype and phenotype represent in problem-solving?
Potential solutions to the problem.
What is the genotype for the Mastermind game?
Binary string with N*3 bits.
How is the phenotype created from the genotype in the Mastermind game?
Aggregate bits 3 by 3, each trio becomes an integer.
What do integers correspond to in the Mastermind game?
Different colours: (0=red, 1=yellow, 2=green, 3=blue…).
What is done with invalid genotypes in the Mastermind game?
Assigned the lowest fitness value to reduce survival chance.
What is the purpose of selection operators in evolutionary algorithms?
Select parents for the next generation.
What is a standard approach for selection in evolutionary algorithms?
Biased roulette wheel.
How does the biased roulette wheel process work?
Individuals are selected based on their fitness proportion.
What is the first step in the biased roulette wheel process?
Compute the probability pi to select an individual.
What is the alternative to the roulette wheel selection method?
Tournament selection.
What is elitism in evolutionary algorithms?
Keeping a fraction of the best individuals in the new generation.
What fraction is usually fixed for elitism?
10%.
What is the role of the crossover operator?
Combine traits of the parents.
What is a common method for crossover?
Single-point crossover.
What is the role of the mutation operator?
Explore nearby solutions in the local solution space.
How is standard mutation on binary strings performed?
Randomly generate a number for each bit; if lower than probability m, mutate.
What is the first step in rd mutation on binary strings?
Randomly generate a number between 0 and 1 for each bit of the genotype.
What happens if the generated number is lower than probability m?
The bit is flipped.
What is m typically set to in rd mutation?
1/(size of the genotype).
What is the purpose of the specific mutation in the Mastermind problem?
To swap groups of 3 bits in the genotype with probability m2.
What is a common stopping criterion for evolutionary algorithms?
When a specific fitness value is reached.
What fitness value indicates an optimal solution in the example?
A fitness value of 4.
What is another stopping criterion besides reaching a fitness value?
After a pre-defined number of generations/evaluations.
What is the first step in the evolutionary algorithm flowchart?
Randomly generate the population.
What do we do after evaluating the population in the evolutionary loop?
Select individuals to keep for the next generation.
What is elitism in the context of evolutionary algorithms?
Keeping a few parents in the new population.
What is the function used to evaluate fitness in Mastermind?
F(x) = p1 + 0.5 p2.
What are evolutionary strategies designed to optimize?
Real values in problems.
What is the main difference between genetic algorithms and evolutionary strategies?
Genotype: genetic algorithms use binary strings, evolutionary strategies use real values.
What does the μ + λ evolutionary strategy represent?
Maintains a steady population of μ + λ individuals.
What is the first step in the μ + λ evolutionary strategy?
Randomly generate a population of (μ + λ) individuals.
What is the selection process in the μ + λ strategy?
Select the μ best individuals from the population as parents.
What is the first step in the evolutionary strategy process?
Randomly generate a population of (μ + λ) individuals.
What do you do after generating the population?
Evaluate the population.
How many best individuals are selected as parents?
Select the μ best individuals from the population as parents (called x).
What is generated from the parents in the evolutionary strategy?
Generate λ offsprings (called y) from the parents.
What is the formula for generating offspring?
For offspring, use the formula: 𝑦𝑖 = 𝑥𝑗 + ℵ(0, 𝜎) where j = random individual in μ.
How is the population defined in the evolutionary strategy?
Population = union of parents and offspring: population = (∪𝑖 𝜆 𝑦𝑖) ∪ (∪𝑗 𝜇 𝑥𝑗).
What is the main challenge in evolutionary strategies?
The main challenge comes in fixing the hyperparameter 𝜎.
What happens if 𝜎 is too large?
If 𝜎 is too large, the population moves quickly to the solution but struggles to refine it.
What happens if 𝜎 is too small?
If 𝜎 is too small, the population moves slowly and might be affected by local optima.
How can 𝜎 be adjusted over time?
Change 𝜎's value over time to adapt to the situation by adding sigma into the genotype.
What is the new genotype defined as?
Define another genotype as xj’ = {xj, σj} composed of the initial genotype and sigma value.
How is the new offspring's sigma calculated?
Calculate 𝜎𝑖 = 𝜎𝑗 exp(𝜏0ℵ(0, 1)).
What does the learning rate depend on?
The learning rate 𝜏0 is proportional to 1/√𝑛, where n is the number of dimensions of the genotype.
Why is substituting 𝜎 with 𝜏0 beneficial?
The selection of 𝜏0 is less critical than the value of 𝜎, allowing more flexibility in setting it.
What is a variant of evolutionary strategies?
CMA-ES algorithm, which evolves a covariance matrix.
What is an approach to genetic algorithms?
Discretise the parameters and use binary strings.
What is the goal of taking inspiration from natural evolution?
To find effective solutions for survival and adaptation in environments.
What is the purpose of novelty search?
To use novelty instead of fitness value to drive the search for optimality.
What does the rch algorithm focus on instead of fitness?
Novelty value
What is the purpose of the novelty archive?
To store all encountered solutions for novelty calculation
How is novelty calculated?
By summing distances to the k nearest neighbors (k=3)
What does a larger novelty indicate?
More difference from previous solutions
What does the behavioral descriptor characterize?
Aspects of solutions and distances between them
Why is the behavioral descriptor task-specific?
It defines features to compare based on the task
What can happen if a feature is ignored in the behavioral descriptor?
Loss of potentially useful information
What is an example of a behavioral descriptor for a robot?
(x, y) coordinates of the robot's final position
What problem can a fitness-focused algorithm encounter?
Getting stuck in local minima
How does novelty search differ from traditional evolutionary algorithms?
It uses novelty score instead of fitness for evaluation
What is the goal of Quality-Diversity Optimization?
To learn diverse and high-performing solutions in one process
What does the concept of Quality-Diversity Optimization apply to?
Real-valued search space
What is a potential benefit of novelty search for a bipedal robot?
Leads to a more stable and successful robot
What is the goal of high-dimensional hyperspace exploration?
To find points that lead to the most interesting solutions.
What does the concept of behavioural descriptors help generate?
A collection of high-performing solutions with high diversity and performance.
How many degrees of freedom does the robot in the example have?
12 degrees of freedom (2 in each leg).
How many real-valued dimensions are there for the robot's movement?
36 real-valued dimensions.
What is the behavioural descriptor for the robot's movement?
Proportion of time each leg touches the ground (6 dimensions).
What is the goal of varying the proportions of time each leg spends touching the ground?
To find an optimal solution for walking as fast as possible.
How many ways to walk were found using the MAP-Elites algorithm?
Over 13,000 ways to walk.
What are the two main focuses of Quality-Diversity (QD) algorithms?
Measuring performance of solutions and distinguishing different types of solutions.
What is a fitness function used for in QD algorithms?
To measure the performance of solutions.
What does the behavioural descriptor characterize in QD algorithms?
It distinguishes different types of solutions.
What does Novelty Search with Local Competition optimize?
Two fitness functions: novelty score and local competition.
What is the concept of Local Competition in QD algorithms?
Comparing new solutions only with similar ones in the same categories.
What does LC(x) represent in Local Competition?
Number of solutions that x outperforms within its k nearest neighbours.
What happens when a better version of a solution is found in the archive?
The worse version is replaced by the better one.
What is the goal of MAP-Elites?
To discretise the behavioural descriptor space in a grid and fill it with the best solutions.
What does MAP-Elites stand for?
Multi-Dimensional Archive of Phenotypic Elites.
What is the main advantage of MAP-Elites?
Easy to implement and performs well in general.
What is a disadvantage of MAP-Elites?
Density of the solution is not always uniform.
How does MAP-Elites add new solutions?
If the cell is empty, the new solution is added; if occupied, the best fitness solution is kept.
What is the hyper-parameter in MAP-Elites?
Size of the cells (resolution of the grid).
What is the first step in the MAP-Elites process?
Randomly initialise some solutions to place in the grid.
What happens during the mutation operator in MAP-Elites?
Gaussian noise is added to some/all values of the selected solution.
What is a common metric for diversity in MAP-Elites?
Archive size (number of solutions stored in the collection).
What does the QD-score represent?
The sum of the fitness of all solutions in the archive.
What is the trade-off in QD algorithms represented by?
A Pareto-front to define the best variant of the algorithm.
What is the usual metric for performance in MAP-Elites?
Max or mean fitness value of all solutions.
What does the coverage refer to in MAP-Elites?
Number of filled cells, number of individuals, or % of filled cells in the grid.
What is the purpose of local competition in the algorithm?
To explore many different solutions in the entire space.
What is the addition mechanism in MAP-Elites?
It determines how new solutions are added to the grid based on their fitness.
What is a general framework in QD algorithms?
Allows use of different operators to define quality diversity algorithms for specific tasks.
What does the selector do in QD algorithms?
Selects the individual to be mutated and evaluated in the next generation.
What is the simplest selection method used in MAP-Elites?
Uniform random selection over the solutions in the container.
What are the criteria for proportional selection in QD?
Fitness, novelty, curiosity score.
How can solutions be stored in QD?
Discretised grid (like MAP-Elites) or unstructured archive (like Novelty Search).
What is a key feature of the unstructured archive in QD?
Maintains density instead of strict discretisation.
What is the process for using advanced mutations in QD?
Select multiple operators in stochastic selection, then apply cross-over before mutation.
What is the QD algorithm for teaching a robot to walk?
Unstructured archive + random uniform selector.
What is the behavioral descriptor for the walking robot?
X/Y coordinate position of the robot after 3 seconds.
What is the fitness measure for the walking robot?
Angular error at the end of the trajectory w.r.t. an ideal circular trajectory.
What is the QD algorithm for teaching a robot to push a cube?
MAP-Elites (grid + random uniform selector).
What is the behavioral descriptor for the cube-pushing robot?
Final position of the cube, where diversity is desired.
What is the fitness measure for the cube-pushing robot?
Energy efficiency of the movement.
What are genetic algorithms, evolutionary strategies, and evolutionary algorithms based on?
The same basic concepts.
What is the definition of artificial intelligence according to Kurzweil, 1990?
The art of creating machines that perform functions requiring intelligence
What is Computational Intelligence according to Poole et al., 1998?
The study of the design of intelligent agents
What is the focus of Charnak and McDermott, 1985 regarding AI?
The study of mental faculties through computational models
What is Winston, 1992's perspective on AI?
The study of computations that enable perception, reasoning, and acting
What approach does this course take towards AI?
The human route, programming computers to act humanly or learn from experience
What is machine learning?
The field of machine learning is concerned with constructing computer programs that automatically improve with experience.
What is the focus of machine learning according to Tom Mitchell's 1997 definition?
A computer program learns from experience E with respect to tasks T and performance measure P if its performance improves with experience.
What are the three main categories of machine learning settings?
Supervised, Unsupervised, and Reinforcement.
What is clustering in unsupervised learning?
Dividing data into groups based on similarities, like dogs and cats.
What is dimensionality reduction?
Identifying important features in data, like enhancing a blurry image of a face.
What is reinforcement learning?
An algorithm interacts with the environment to produce a reward signal for improvement.
What is policy search in reinforcement learning?
Finding actions for an agent to maximize received rewards based on its state.
What is semi-supervised learning?
Some data have labels, some do not; aims to label unlabelled data using labelled items.
What is weakly-supervised learning?
Inexact output labels; e.g., indicating an item is somewhere in an image without precise location.
What is classification in machine learning?
Assigning discrete or categorical variables to inputs, like predicting actions in videos.
What is multi-label classification?
A classification task where multiple labels can be correct for a single input.
What is simple regression?
1 input variable and 1 output variable. E.g., size of a house predicts its price.
What is multiple regression?
Multiple input variables and 1 output variable. E.g., grade calculator with 3 inputs and 1 output (grade).
What is multivariate regression?
Multiple inputs to predict multiple outputs. E.g., predicting the location of an umbrella from a picture.
What is an example regression problem?
Given time as input, the regressor predicts the value at that time.
What characterizes a good predictor in regression?
The line is close to most points, even if it is off.
What characterizes a very good predictor in regression?
It predicts given points well but may struggle with unknown examples.
What is supervised learning?
Most common setting in ML problems, typically involves classification and regression.
How does Antoine classify shapes?
By placing data along 2 axes (colour and points) to create a classifier.
What is a linear classifier?
A classifier that uses a straight line to separate data into categories.
Why is selecting good features important?
Good features improve prediction accuracy; combining features is often better.
What are two ways to make predictions?
What is the goal of generating a model in supervised learning?
To approximate the true function using input data to predict outputs.
What is the training dataset defined as?
A sequence of pairs of input and output labels (Xn and yn).
What is feature encoding in supervised learning?
Transforming raw input observations into a modified version (feature space).
What is the purpose of the Xtest dataset?
To evaluate model performance on unseen data by comparing predicted outputs with ground truth.
What do we compute to measure model performance?
A score comparing predicted outputs with the ground truth/gold standard annotation.
What is the purpose of the truth/gold standard annotation?
To compute a score measuring model performance.
Why is it important to examine data before designing an algorithm?
It can provide clues for classifier design and help identify class label distribution.
What happens if class labels are imbalanced?
The algorithm may learn to identify only the majority class.
What is the curse of dimensionality?
As dimensions increase, data becomes sparse and training data may be noisy.
What is the Bag of Words method in NLP?
Logging the frequency of words without tracking their positions.
What is the modern approach to feature encoding in deep learning?
Letting the algorithm figure out optimal features from raw data.
What is a lazy learner?
Stores training examples and generalizes upon explicit request at test time.
What is the opposite of the other guy in ML models?
Learns and generalises all it can before test time, resulting in quicker test time.
How does a nearest neighbour classifier work?
Looks at the nearest neighbour and classifies itself as the same.
What is a Linear Model?
Assumes the data is linearly separable, learning the best line to separate it.
What does a Linear Model classify?
Anything on the left as a green diamond, anything on the right as a red circle.
What is Feature Space Transformation?
Representing data differently to analyze and separate it more easily.
How do Neural Networks handle non-linear datasets?
Try to learn how to transform the feature space automatically.
What is the Bias-Variance trade-off?
A balance between overfitting (high variance) and underfitting (high bias).
What does MSE stand for?
Mean Squared Error, measures average square distance between correct and predicted outputs.
What is the Baseline in performance evaluation?
The lower bound for performance, often chance/random performance.
What is the Upper bound in performance evaluation?
The best case, often compared to human performance.
What are Decision Trees in ML?
Eager learners that process all data upfront and discard it after analysis.
What does the Nearest Neighbour Classifier do?
Classifies a test instance to the class label of the nearest training instance.
What does increasing k do to the classifier?
Makes the decision boundary smoother and less sensitive to training data
What is the curse of dimensionality in k-NN?
Distance metrics may not work well in high dimensional spaces
What is decision tree learning?
A method for approximating discrete classification functions using a tree-based representation.
What type of search do decision tree learning algorithms use?
Top-down greedy search through the space of possible solutions.
What is the first step in the general decision tree algorithm?
Search for the optimal splitting rule on training data.
What is the goal of finding an optimal split rule?
To create partitioned datasets that are more 'pure' than the original dataset.
What does Gini Impurity measure?
The probability of incorrectly classifying a randomly picked point according to class label distribution.
What is Variance Reduction mainly used for?
Regression trees where the target variable is continuous.
What is the formula for the amount of information required to determine the state of a random variable?
I(x) = log2(K).
What happens to information required when the impostor is more likely in one box?
Low entropy; less new information is gained.
What is the information required when the impostor is equally likely in 4 boxes?
I(x) = -log2(1/4) = 2 bits.
What does low entropy indicate?
You don’t need to know a lot of information to predict the value of a random variable.
What does high entropy indicate?
A lot of new information is gained when predicting the value of a random variable.
What is information gain?
Difference between initial entropy and weighted average entropy of subsets.
What is the binary tree information gain formula?
𝐼𝐺(𝑑𝑎𝑡𝑎𝑠𝑒𝑡, 𝑠𝑢𝑏𝑠𝑒𝑡𝑠) = 𝐻(𝑑𝑎𝑡𝑎𝑠𝑒𝑡) − (|𝑆𝑙𝑒𝑓𝑡|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆𝑙𝑒𝑓𝑡) + |𝑆𝑟𝑖𝑔ℎ𝑡|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆𝑟𝑖𝑔ℎ𝑡))
What are categorical values in decision trees?
Search for the most informative feature, create branches for each value.
What is the formula for information gain for 'outlook'?
IG(D, outlook) = H(D) - (5/14 H(Dsunny) + 4/14 H(Dovercast) + 5/14 H(Drain))
What is the first step in pruning a decision tree?
Go through each internal node connected only to leaf nodes.
What does a random forest consist of?
A collection of decision trees trained on different subsets of data.
What is the outcome of the algorithm in a random forest?
The majority vote by all the different trees.
How do you make predictions with regression trees?
By taking an average or weighted average of samples in the leaves.
What is the purpose of taking an average in machine learning predictions?
To make predictions based on the distance of different samples in the leaves of the tree.
What is the ultimate goal when creating machine learning systems?
To develop models that generalise to previously unseen examples.
Why is shuffling important before splitting a dataset?
To avoid implicit ordering in the dataset that can bias results.
What are hyperparameters in machine learning?
Model parameters chosen before training, such as 'k' in k-NN.
What is the motivation behind hyperparameter tuning?
To choose hyperparameter values that give the best performance.
What is a disadvantage of testing hyperparameters on the training dataset?
It usually does not generalise well to unseen examples.
What should never be done when evaluating hyperparameters?
Using the test dataset to select hyperparameters based on accuracy.
What is the correct approach for dataset splitting in machine learning?
Split into training, validation, and test sets, e.g., 60:20:20.
What is hyperparameter tuning/optimisation?
Selecting parameters that produce the best classifier performance.
What can be done for final evaluation after hyperparameter tuning?
Optionally include the validation set back into the training set.
What can be included in the training set for final evaluation?
Validation set can be included to retrain the model on the whole dataset after finding best hyperparameters.
What is the purpose of including the validation set in training?
It provides more data for training, potentially increasing model performance.
What is a risk of developing and evaluating a model on the same data?
It results in overfitting the model to the training data.
What should the test set be used for?
The test set should only be used for estimating performance on unknown examples.
What is cross-validation used for?
Cross-validation is used when the dataset is small to ensure effective testing.
What are the steps in cross-validation?
What does the global error estimate formula represent?
It averages performance metrics across all k held-out test sets.
What is important about cross-validation in model evaluation?
It evaluates an algorithm rather than a single trained instance of a model.
What is one option for parameter tuning during cross-validation?
Use 1 fold for testing, 1 for validation, and k-2 for training in each iteration.
What is an alternative method for parameter tuning in cross-validation?
Cross-validation within cross-validation, separating 1 fold for testing.
How does the second option for parameter tuning help?
It allows for optimal hyperparameters to be found using more data.
What is the advantage of using different hyperparameters on each fold during cross-validation?
It likely leads to the best results for small data sets.
What is a disadvantage of using different hyperparameters on each fold?
It requires more work and experiments than simpler methods and is not practical in all situations due to high computation needs.
What is the advantage of testing on all data when going into production?
You can use all available data to train the model for better performance.
What is a disadvantage of testing on all data?
You cannot estimate the performance of the final trained model anymore; you rely on hyperparameters generalizing.
What are the steps in CASE 1 for plenty of data available?
What are the steps in CASE 2 for limited data available?
What does a confusion matrix represent?
It visualizes performance, showing true labels vs. predicted labels, allowing analysis of model performance.
What is precision in model evaluation?
Precision = TP / (TP + FP). It measures the correctness of positive predictions.
What does high precision indicate?
If a model predicts something as positive, it is likely to be correct.
What is recall in model evaluation?
Recall = TP / (TP + FN). It measures the ability to find all positive examples.
What does high recall indicate?
Good at retrieving positive examples, but may include false positives.
What is the trade-off between precision and recall?
High precision often leads to low recall and vice versa.
What is the formula for F1 score?
\( F1 = \frac{2 \cdot precision \cdot recall}{precision + recall} \)
What is accuracy in classification?
Accuracy = \( \frac{Number \ of \ correctly \ classified \ examples}{Total \ number \ of \ examples} \)
What is the difference between micro-averaging and macro-averaging?
Macro-averaging averages metrics at the class level; micro-averaging at the item level.
What is the effect of micro-averaging on precision, recall, and F1 in binary and multi-class classification?
They equal accuracy.
What are the five important model characteristics in ML?
Accurate, Fast, Scalable, Simple, Interpretable
What does a normalized confusion matrix achieve?
Calculates metrics as if evaluated on a balanced dataset
What is one view of system performance on a balanced test set?
The classifier's performance remains the same.
What should be evaluated for a more realistic scenario?
The system should be evaluated with data having a realistic distribution.
What does overfitting indicate about model performance?
Good performance on training data, but poor generalization to other data.
What does underfitting indicate about model performance?
Poor performance on both training and test data.
What happens to classification error as models learn?
Classification error decreases for training but may increase for test data.
What can cause overfitting?
A model that is too complex or training data that is not representative.
What is the impact of a small test set on accuracy?
90% accuracy on 10 samples differs from 84% accuracy.
What affects confidence in evaluation results?
The size of the test set affects confidence in evaluation results.
What is true error?
True error is the probability that the model misclassifies a randomly drawn example from a distribution.
What is sample error?
Sample error is the classification error based on a sample from the underlying distribution.
How is sample error mathematically defined?
Sample error is defined as: 𝑒𝑟𝑟𝑜𝑟𝑆(ℎ) ≡ (1/N) ∑ 𝛿(𝑓(𝑥), ℎ(𝑥)) for x ∈ S.
What is a confidence interval?
An N% confidence interval is an interval that is expected with probability N% to contain the parameter q.
What does a 95% confidence interval [0.2, 0.4] mean?
It means that with probability 95%, the true parameter q lies between 0.2 and 0.4.
How does sample size affect confidence intervals?
As sample size n increases, confidence interval boundaries get closer to 0, leading to narrower intervals.
What is the example confidence interval for errorS(h) = 0.22 with n = 50?
With n = 50 and ZN = 1.96, the confidence interval for errorD(h) is quite large (22%).
What does statistical significance testing help determine?
Statistical significance testing helps determine if there is a difference between two distributions of classification errors.
What does a graph with overlapping distributions indicate?
Overlapping distributions indicate uncertainty about which classifier is better due to sampling error.
What is the Marek ApprovedTM test?
The Marek ApprovedTM test is the Randomisation test, considered intuitive for comparing algorithms.
What is the Marek ApprovedTM test?
The Marek ApprovedTM test is the Randomisation test, as it is the most intuitive to him.
What do statistical tests determine?
Statistical tests tell us if the means of two sets are significantly different.
How does the Randomisation test work?
It randomly switches predictions between two systems and measures if the performance difference is greater or equal to the original difference.
What does a small p-value indicate?
A small p-value means we can be more confident that one system is different from the other.
What is the null hypothesis?
The null hypothesis states that the two algorithms/models perform the same and differences are due to sampling error.
What is the significance level for performance difference?
Performance difference is statistically significant if p < 0.05 (5%).
What is P-hacking?
P-hacking is the misuse of data analysis to find patterns that appear statistically significant without an underlying effect.
What happens if the number of experiments increases in P-hacking?
Increasing experiments can lead to a higher false discovery proportion, even if true discoveries remain the same.
What is the false positive rate in the example of P-hacking?
P(false positive) = 0.05, the same as the significance level.
What is the false discovery proportion in the initial example?
The false discovery proportion is 35 / 115 = 30%.
What happens to the false discovery proportion when experiments increase to 2400?
The false discovery proportion increases to 80 / 195 = 59%.
What is a method to defend against unintentional p-hacking?
Adaptive threshold for calculating p-value (Benjamini & Hochberg, 1995)
What does the Bejamini-Hochberg critical value formula represent?
New significance threshold (critical value)
What is the downside of the Benjamini-Hochberg method?
Thresholds for most experiments will be lower than the original 5%
What are Artificial Neural Networks (ANNs)?
A class of ML algorithms optimized with gradient descent
Why has deep learning become more popular now?
Better conditions for implementation, like big data and faster hardware
What is backpropagation?
Described in 1974 by Werbos, it is a training algorithm for neural networks
What are LSTMs and CNNs?
Key components of modern neural network architectures described in the late '90s
What is a benefit of having large datasets for neural networks?
They improve training efficiency and effectiveness
What advancements have improved neural network training?
Better CPUs and GPUs for efficient computation
What has improved the accessibility and affordability of graphics cards?
Increased efficiency and reduced cost
What are automatic differentiation libraries used for?
They handle back propagation and optimization of model parameters
What is linear regression useful for in machine learning?
It serves as a stepping stone towards neural network models
What do derivatives show in the context of linear regression?
How to change each parameter value to reduce loss
What is the learning rate in gradient descent?
The learning rate, denoted as 𝛼, is a hyperparameter that determines the size of the steps taken towards the minimum of the loss function.
What does 𝜕𝐸/𝜕𝑎 represent?
It represents the partial derivative of the loss function with respect to parameter 𝑎.
What is the formula for updating parameter 𝑎?
The update rule is: 𝑎𝑛𝑒𝑤 := 𝑎𝑜𝑙𝑑 - 𝛼 ∑(ax(𝑖) + 𝑏 - 𝑦(𝑖))𝑥𝑖/N, where N is the total number of data points.
What does an epoch represent in machine learning?
An epoch is one complete pass over the entire dataset during training.
What is the gradient in vector notation?
The gradient is a vector of all partial derivatives for a function with K parameters: ∇𝜃f(𝜃) = [𝜕f(𝜃)/𝜕𝜃1, 𝜕f(𝜃)/𝜕𝜃2, ..., 𝜕f(𝜃)/𝜕𝜃𝐾].
What is the analytical solution for linear regression?
The analytical solution allows finding optimal parameters without iterating through epochs by solving a specific equation.
What is the complexity of matrix inversion?
Matrix inversion has cubic complexity, making it computationally expensive for large problems.
What is multiple linear regression?
Multiple linear regression uses multiple input features, each with its own parameter, to predict an output value.
How does the RMSE change with multiple features?
The RMSE (Root Mean Square Error) is typically lower with multiple features due to increased information for prediction.
What is RMSE in model evaluation?
Root Mean Square Error (RMSE) measures the differences between predicted and observed values; lower RMSE indicates better model accuracy.
How does using more features affect model predictions?
Using more features provides more information, leading to more accurate predictions in the model.
What does a linear regression model represent in higher dimensions?
In higher dimensions, the linear regression model is a continuous linear plane representing the learned data.
What is the role of the nucleus in a biological neuron?
The nucleus acts like the neuron's brain, telling it what to do.
What do dendrites do in a biological neuron?
Dendrites connect to other neurons and receive signals from them.
What happens when a biological neuron's axon fires?
When conditions are right, the axon fires a signal to connect with other neurons' dendrites.
What are input features in an artificial neuron?
Input features (xi) are the values fed into the artificial neuron, each with an associated weight (θi).
What determines the importance of a feature in an artificial neuron?
The weight (θi) associated with each input feature determines its importance in the artificial neuron.
What does the output of an artificial neuron involve?
The output involves multiplying features and weights, and adding the bias (b).
What is the activation function in an artificial neuron?
The activation function (g) transforms the output of the linear equation into a new value.
How can the bias term be included in the equation?
The bias term can be included by reformulating the equation to add an extra feature and weight for the bias.
What is the vector notation for input features and weights?
Input features and weights can be represented as vectors: x = [x1, x2, ..., xK], W = [θ1, θ2, ..., θK].
What is the logistic activation function used for?
The logistic function (sigmoid) squashes any value into a range between 0 and 1.
What does logistic regression actually do?
Logistic regression performs binary classification using the logistic function, not actual regression.
How is the logistic regression model optimized?
The logistic regression model is optimized using gradient descent.
What is a perceptron?
A perceptron is an algorithm for supervised binary classification, an early version of an artificial neuron.
What activation function does a perceptron use?
A perceptron uses a threshold function as its activation function, outputting 0 until a certain limit is reached.
What does gradient descent use as its activation function?
A threshold function that outputs 0 until a limit (θ) is reached, then outputs 1.
What happens when y = 1 and h(x) = 0?
Weight θi is increased if xi is positive, decreased if negative.
What happens when y = 0 and h(x) = 1?
We want to decrease the summation, so we do the opposite to reduce WT x.
Why can't a perceptron learn XOR?
XOR is not linearly separable; one linear line cannot separate the classes.
What is a multi-layer perceptron (MLP)?
A network that connects neurons in sequence to learn higher order features.
What is the role of hidden layers in a neural network?
They process features and are not visible from the outside.
What is the first and last layer of a neural network called?
The first layer is the input layer and the last is the output layer.
What should you check when something isn’t working in a neural network?
Ensure that the matrix dimensions match.
What is b in the context of a neural network layer?
The layer-specific bias vector, unique to each neuron in a layer.
What was the approach to feature crafting before multi-layer networks?
Manually crafting features for pattern recognition.
What is the benefit of training both feature extraction and classification layers together?
They optimize each other based on data.
What happens if we only use linear activation functions in a multi-layer network?
It becomes equivalent to a single-layer network.
What happens when a two-layer network uses linear activation?
It collapses into a single-layer network, unable to capture complex non-linear patterns.
What do non-linear activation functions do?
They allow models to learn complicated patterns by breaking the dependency of multiple layers collapsing into one.
What is the range of the sigmoid activation function?
The sigmoid function compresses output into the range between 0 and 1.
What is the range of the tanh activation function?
The tanh function maps input values to the range -1 to 1.
What is the formula for the tanh activation function?
f(x) = tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
What characterizes the ReLU activation function?
ReLU is linear and unbounded in the positive part, but non-linear overall.
What does the softmax activation function do?
It scales inputs into a probability distribution that sums to 1.
What is a common activation function for deep neural networks?
ReLU is commonly used in very deep neural networks, especially for image recognition.
What is a potential issue with using ReLU?
ReLU can produce unbounded values, leading to confusion in the network.
What should you try first when designing models?
Experiment with tanh and sigmoid first, as they are bounded.
How should the choice of activation function in hidden layers be treated?
It is a hyperparameter that can be set empirically or optimized using a development set.
How can we set hyperparameters for activation functions?
Empirically or using a development set to find the best performing function for the model and dataset.
What activation function is commonly used for binary classification?
Sigmoid is most common; tanh can also be used.
What activation function should be used for predicting unbounded scores?
Use a linear activation function.
What activation function is most commonly used for predicting a probability distribution?
Softmax is used for multi-class classification.
How many neurons are in the hidden layer of the PyTorch neural network?
There are 5 neurons in the hidden layer.
What activation function is applied in the hidden layer during the forward pass?
Tanh is used as the activation function.
What is the purpose of the loss function in neural networks?
To minimize and show performance on a specific task.
How do we update parameters in neural networks?
Using gradient descent to minimize the loss function.
What is the formula for updating parameters in gradient descent?
\( \theta_i^{(t+1)} = \theta_i^{(t)} - \alpha \frac{\partial E}{\partial \theta_i^{(t)}} \)
What is the formula for Mean Squared Error (MSE)?
MSE = \frac{1}{N} \sum_{i=1}^{N}(\hat{y}_i - y_i)^2
What is the primary goal of classification tasks?
To choose between different categories or discrete options.
What is multi-class classification?
Classification with more than 1 class, where each input belongs to exactly 1 class.
What do we want to maximize in classification?
The likelihood of the network assigning correct labels.
What is the probability formulation for binary classification?
\prod_{i=1}^{N} (\hat{y}^{(i)})^{y^{(i)}} (1 - \hat{y}^{(i)})^{(1-y^{(i)})}
What happens if the network assigns the correct label for every data point?
The product approaches 1.
What is the issue with multiplying probabilities in classification?
It can lead to underflow errors.
How can we avoid underflow errors in classification?
By maximizing the logarithm of the probability formula.
What is the formula for binary cross-entropy loss?
-\sum_{i=1}^{N} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]
What is the formula for binary cross-entropy loss?
\( L = -\frac{1}{N} \sum_{i=1}^{N} [y(i) \log(\hat{y}(i)) + (1 - y(i)) \log(1 - \hat{y}(i))] \)
What does normalizing by the number of data points do in loss calculation?
It makes the loss magnitude independent of the number of data points.
What is the formula for categorical cross-entropy loss?
\( L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_c(i) \log(\hat{y}_c(i)) \)
In categorical cross-entropy, what does y_c represent?
1 if C is the correct class for data point i, 0 otherwise.
What is the output layer configuration for a multi-class classification neural network example?
An output layer with 3 neurons predicting probabilities over 3 flower types.
What is batching in neural networks?
Combining vectors of several data points into a matrix for simultaneous processing.
Why is batching beneficial for training neural networks?
It increases speed and reduces noise, leveraging GPU efficiency.
How does batching assist in regularization during optimization?
It combines updates from several data points easily.
What is the benefit of batching in neural networks?
Combines updates from several datapoints, making updates more stable and accurate.
What is the input matrix X in a neural network?
A batch of data points with dimensions n x k, where n is data points and k is feature vectors.
What does the first layer in a neural network apply?
A linear transformation using a weight matrix and adding a bias.
What is Z in the context of a neural network?
The output matrix after applying the weight matrix and bias in the first layer.
What do we get after applying the activation function to Z?
A, the output of the first hidden layer.
What is the purpose of calculating loss in a neural network?
To determine how well the model performs.
What method is used to update model parameters in neural networks?
Gradient descent is used to update weight matrices and biases.
What is backpropagation in neural networks?
A method to calculate necessary partial derivatives iteratively.
How does backpropagation simplify calculations?
It breaks down calculations into smaller steps, moving backwards through the network.
What is the chain rule used for in neural networks?
To calculate the derivative of a composite function.
What does the chain rule formula represent?
It shows how to break down derivatives into smaller parts for easier calculation.
How can we find the partial derivative of the loss with respect to W[1]?
By breaking it down through Z[1] and A[1] using their respective derivatives.
What are the two types of partial derivatives in backpropagation?
The output of an activation function w.r.t its input and the output of a linear transformation w.r.t its input.
What is the purpose of the partial derivative in backpropagation?
To update the weights of the linear transformation in the neural network.
What does the partial derivative of a matrix w.r.t another matrix represent?
A 4-D tensor containing the partial derivatives of every element in the first matrix w.r.t every element in the second.
What is the linear transformation notation used in backpropagation?
Z = XW, where Z is the output, X is input, and W is weights.
What do you need to calculate to update weights in a linear transformation?
The partial derivative of the loss w.r.t the weights and the bias vector.
What is the shape of the partial derivative of a scalar w.r.t a matrix?
It has the same shape as the original matrix itself.
What is the key component in the derivatives during backpropagation?
The partial derivative of the loss w.r.t the output of the linear transformation.
What does backpropagation iteratively calculate?
Partial derivatives, taking them from the top layers and passing them down.
What is the bias vector used for in backpropagation?
It is repeated for each neuron in the layer to add the same bias to each input vector.
What is necessary for lower levels to calculate their own partial derivatives?
The gradient of the loss w.r.t the input and the weight's partial derivative.
What is the significance of the dimensions N, D, and M in backpropagation?
They represent the number of inputs, dimensions, and outputs respectively.
What happens during the forward pass in a neural network?
The operation takes X and W as inputs and produces output Z.
What does the partial derivative of the loss w.r.t one element depend on?
It depends on the weights it multiplies with and the loss of whatever uses this element.
How many output values does the particular element affect?
It affects exactly 3 output values: z1,1, z1,2, and z1,3.
What is the equation for the partial derivative of the element?
The equation uses the chain rule and involves the weight w1,1 and the partial derivative of z1,1 w.r.t x1,1.
What happens when you calculate the partial derivative w.r.t the full matrix X?
It can be expressed as a dot product of two matrices.
What do the two matrices in the dot product represent?
The first is the partial derivative of the loss w.r.t Z, and the second is the transposed weight matrix for the layer.
What is the importance of backpropagation for inputs X?
It is a simple way of calculating backpropagation for inputs in a given layer.
How do we calculate the partial derivative w.r.t the weights?
By breaking it down for one individual weight, considering its effect on the output.
What does one weight affect in the output?
One weight affects two values in the output for two data points in the batch.
What is the equation for the partial derivative of the loss w.r.t the weights?
It is a dot product of the partial derivative of the loss w.r.t Z and the transposed matrix of features XT.
What do we need to calculate for the bias vector?
The partial derivative of the loss w.r.t the bias vector.
What result do we get for the partial derivative of the loss w.r.t the bias?
It is equal to a transposed column vector of 1s times the partial derivative of the loss w.r.t z.
What is needed to perform full backpropagation through the neural network?
How to handle the activation functions.
What is the purpose of activation functions in a neural network?
Activation functions are applied element-wise to introduce non-linearity, allowing the network to learn complex patterns.
Do activation functions have parameters that need updating during training?
No, activation functions generally do not have parameters that need to be updated during training.
What is the derivative of an activation function denoted as?
The derivative of an activation function is denoted as g′(x).
What does the chain rule help with in back propagation?
The chain rule helps calculate the partial derivative of the loss with respect to the inputs of the activation function.
What is the formula for the Sigmoid activation function?
For Sigmoid: g(z) = 1/(1 + e^(-z)), g′(z) = g(z)(1 - g(z)).
What is the formula for the Tanh activation function?
For Tanh: g(z) = (e^z - e^(-z))/(e^z + e^(-z)), g′(z) = 1 - g(z)².
What is the ReLU activation function and its derivative?
For ReLU: g(z) = z for z > 0, 0 for z ≤ 0; g′(z) = 1 for z > 0, 0 for z ≤ 0.
How is Softmax different from other activation functions?
Softmax takes a whole vector as input and outputs a whole vector, unlike other activation functions applied element-wise.
What is the purpose of combining Softmax with cross-entropy?
Combining Softmax with cross-entropy simplifies the backpropagation of derivatives for classification tasks.
What does the joint partial derivative through Softmax and cross-entropy represent?
It represents the predictions minus the true class labels, normalized by N if applicable.
What is gradient descent?
Gradient descent is an optimization algorithm that updates parameters by taking small steps in the negative direction of the gradient.
What is the formula for updating weights in gradient descent?
W_new = W_old - α * (∂L/∂W), where α is the learning rate.
What is the learning rate in gradient descent?
The learning rate (α) is a hyperparameter that determines the step size for updating model parameters.
What is the formula for updating weights in gradient descent?
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝛼 \( \frac{\partial L}{\partial W} \)
What does α represent in gradient descent?
Learning rate/step size, a hyperparameter based on the development set.
What must be true for gradients to be computed in neural networks?
Network functions and the loss need to be differentiable.
What is the termination condition in gradient descent?
When the loss function does not improve anymore.
What is a common issue when updating weights during backpropagation?
Updating weights before finishing using original weights can cause errors.
What is Stochastic Gradient Descent (SGD)?
Calculating the gradient based on one data point and updating weights immediately.
What are the steps in Stochastic Gradient Descent?
What is Mini-batched Gradient Descent?
A balance between batch and stochastic gradient descent, using batches of data points.
What are the steps in Mini-batched Gradient Descent?
What is a challenge in optimising neural networks?
Finding the lowest point on complex loss surfaces is difficult.
Why is the learning rate important?
The size of the learning rate significantly affects the training process.
What happens if the learning rate is too low?
Optimization can take a very long time to reach a good minimum.
What is the ideal state of the learning rate?
It allows reaching the minimum of the loss function in a reasonable number of steps.
What happens if a parameter has not been updated for a while?
The learning rate for that parameter may be increased.
What happens if a parameter is making big updates?
The learning rate for that parameter may be decreased.
What is the intuition behind learning rate decay?
Take smaller steps as we approach the minimum to avoid overshooting.
When can learning rate decay be performed?
Every epoch, after a certain number of epochs, or when validation performance doesn't improve.
Why should we not set all weights to zero?
Neurons will learn the same things, leading to the same optimized values.
What is a common method for weight initialization?
Drawing randomly from a normal distribution with mean 0 and variance 1 or 0.1.
What does Xavier Glorot initialization do?
Draws values from a uniform distribution based on the number of neurons in layers.
What is the formula used in Xavier Glorot initialization?
Weights are drawn from a uniform distribution defined by boundaries involving the number of neurons.
What role does randomness play in neural networks?
It is important for various aspects of the learning process.
What role does randomness play in neural networks?
Different random initialisations lead to different results and performance.
What is the solution to controlling randomness in neural networks?
Explicitly set the random seed for all random number generators used.
What can happen when processes are parallelised on GPUs?
They can produce randomly different results due to different threads running at different times.
How should you report model performance under different random seeds?
Report the mean and standard deviation of the performance.
What is min-max normalisation?
Scaling the smallest value to a and the largest to b, e.g., [0, 1] or [-1, 1].
What is standardisation (z-normalisation)?
Scaling the data to have mean 0 and standard deviation 1.
Why is normalisation important in neural networks?
It helps weight updates to be proportional to the input, improving model learning accuracy.
What should you remember about normalisation for data columns?
Normalise each column separately, not the entire matrix.
How should normalising constants be calculated?
Calculate them based only on the training set and apply them to test/evaluation sets.
What is gradient checking?
A method to verify if the gradient is calculated correctly in the implementation.
What are the two methods to isolate the gradient?
What is the definition of a partial derivative?
The partial derivative of L(w) with respect to w is defined as: \( \frac{\partial L(w)}{\partial w} = \lim_{\epsilon \to 0} \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon} \)
What indicates a bug in neural network training?
If the values from different methods of calculating partial derivatives are not similar, it indicates a bug.
What is overfitting in neural networks?
Overfitting occurs when a model learns the training data too well, failing to generalize to unseen data.
How can overfitting be prevented?
To prevent overfitting, use held-out validation and test sets to measure generalization performance.
What is network capacity?
Network capacity refers to the number of parameters in a model and its ability to overfit the dataset.
What does it mean if a model is underfitting?
Underfitting means the model performs poorly on both training and validation sets due to insufficient capacity.
How can you improve a model that is underfitting?
Increase the number of neurons, parameters, or layers in the model to improve learning.
What indicates a model is overfitting?
Overfitting is indicated by good performance on the training set but poor performance on the validation set.
What is one method to prevent overfitting?
Limit the number of parameters in the model to prevent memorization of the dataset.
What is the best solution to overfitting?
The best solution to overfitting is to acquire more data for training.
What is early stopping in neural network training?
Early stopping is a method where training is halted when performance on the validation set does not improve for a set number of epochs.
What is regularization in the context of neural networks?
Regularization adds constraints to the model to prevent overfitting, such as penalizing large weights.
What are L2 and L1 regularization?
L2 regularization adds squared weights to the loss function, while L1 regularization adds absolute weights, both helping to control model complexity.
What does L2 regularization do to weights?
L2 regularization penalizes larger weights more, encouraging sharing between features and pushing weights towards 0.
What does L2 regularisation do?
Adds squared weights to the loss function, penalising larger weights more and encouraging sharing between features.
What is the role of the hyperparameter λ in L2 regularisation?
Controls the importance of regularisation, usually set to a low value (e.g., 0.001).
What does L1 regularisation do?
Adds the absolute value of weights to the loss function, using the sign of the weight for updates.
How does L1 regularisation affect weight updates?
The update rule is: 𝑤 ← 𝑤 − 𝛼(𝜕𝐿𝑜𝑠𝑠/𝜕𝑤 + 𝜆 𝑠𝑖𝑔𝑛(𝑤))
How do L1 and L2 regularisation differ in weight management?
L2 pushes all weights towards 0, while L1 encourages sparsity, keeping many weights at 0.
What is dropout in neural networks?
A method to reduce overfitting by randomly setting some neural activations to 0 during training.
What percentage of neurons are typically dropped during training with dropout?
About 50% of neurons are typically dropped at each backward pass.
What happens during testing when using dropout?
All neurons are used, but inputs are scaled to match training expectations.
What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data, while unsupervised learning uses only feature values without labels.
What is the objective of unsupervised learning?
To find hidden structures in the dataset without ground-truth labels.
What is unsupervised learning?
A type of learning where the dataset consists only of feature values without ground-truth labels.
What is the objective of unsupervised learning?
To find hidden structures in the dataset for making inferences or decisions.
What is clustering in unsupervised learning?
The task of finding groups ('clusters') of samples that might belong to the same class.
What is density estimation?
Finding the probability of seeing a point in a certain location compared to another location.
What is dimensionality reduction?
A process to reduce the number of features while retaining important information.
What does clustering imply about intra-cluster variance?
There is low intra-cluster variance among instances in the same cluster.
What are the steps of the k-means algorithm?
Initialisation, Assignment, Update, and checking for convergence.
What is a cluster in clustering?
A set of instances that are similar to each other and dissimilar to instances in other clusters.
How does clustering help in vector quantization?
It improves encoding by clustering information in a datastream to reduce data size.
What is an example of using clustering in nature?
Identifying different species of flowers by plotting features like petal length vs. sepal width.
What is the structure of an unsupervised learning task?
A feature space with datapoints lacking additional information like labels or values.
What does k represent in k-means clustering?
The number of clusters, e.g., k = 3 means there are 3 centroids.
What is the first step in the k-means algorithm?
Initialisation: Select k random instances or generate random vectors for centroids.
What is the goal of the assignment step in k-means?
Assign every point in the dataset to the nearest centroid.
How do we update centroids in k-means?
By computing the average position of all points in each cluster.
What is checked during the convergence step in k-means?
The displacement of centroids; if it's larger than a threshold, loop back to assignment.
What is the formula for the assignment step in k-means?
orall i ext{ in } ext{{1,…,N}} ext{ } c(i) = ext{argmin}_{k ext{ in } ext{{1,…,K}}} ext{ } orm{x(i) - oldsymbol{ u}_k}^2.
What does the update formula in k-means compute?
The average location for all samples assigned to cluster k.
What condition indicates convergence in k-means?
If orall k ext{ } |oldsymbol{ u}_k^t - oldsymbol{ u}_k^{t-1}| < oldsymbol{ ext{ε}}.
What is checked in Step 4 of K-means?
Convergence by computing the movement of centroids between timesteps.
What indicates to stop iterating in K-means?
If the movement of centroids is lower than a certain threshold (𝜖).
How is K-means viewed as a model?
As a model optimization problem with centroid locations and data point assignments.
What is the objective of K-means?
Minimize the loss function L for assignments of data points to centroids.
What does the loss function L represent?
The mean distance between samples and their associated centroid.
What is the significance of K in K-means?
K is a crucial hyperparameter that affects the clustering results.
What is the Elbow Method used for?
To determine the optimal value of K by plotting loss values against K.
What should be selected according to the Elbow Method?
The value of K where the rate of decrease in loss sharply shifts.
What does cross-validation help determine?
The best value for hyperparameters using a validation set.
What is a significant weakness of K-means?
The need to define K, which significantly impacts results.
What is a weakness of K-means regarding its results?
It only finds a local optimum and is sensitive to initial centroid positions.
When is K-means applicable?
When a distance function exists on the dataset, typically with real values.
How does K-medioid algorithm differ from K-means?
It is less sensitive to outliers by using the geometric median.
What shape must clusters have for K-means to work effectively?
Clusters must be hyper-ellipsoids (or hyper-spheres).
What is the objective of density estimation algorithms?
To estimate the probability density function p(x) from data.
What does a Probability Density Function (PDF) model?
The likelihood of a continuous variable being observed within an interval.
What is the goal of generative models in relation to probability?
To model the distribution of a class as p(X | y).
What do discriminative models directly model?
The probability of observing label y given sample values X, p(y | X).
What activation function transforms neural network output into a probability distribution?
Softmax activation.
What does the Softmax activation do?
Transforms the output of the neural network into a probability distribution.
What is Bayes’ rule used for in generative models?
To turn the generative model into a discriminative classifier.
What do non-parametric methods assume about function shape?
They make no assumptions about the form/shape of the function.
What is the bias and variance characteristic of non-parametric methods?
Low bias; high variance depending on the data.
What do histograms do in density estimation?
Group data into bins, count occurrences, and normalize.
What does normalization ensure in histograms?
The integral of the function sums to 1, making it a valid PDF.
What is Kernel Density Estimation?
Estimates the density of a function by using a kernel around training examples.
What does the kernel function do in density estimation?
Computes the difference with the current point x and normalizes according to bandwidth.
What are the characteristics of parametric approaches?
Make assumptions about the shape, inducing bias but fixing the number of parameters.
What is ensured by the normalization factor in Gaussian distribution?
The integral of the distribution sums to 1.
What is the purpose of the normalization term in the Multivariate Gaussian Distribution?
To ensure the double-integral sums to 1.
What does likelihood determine in a model?
How good the model is at capturing the probability of generating data x.
What do we multiply to get the likelihood in a dataset?
The predicted values from the models for every sample with parameters θ.
Why do we calculate negative log-likelihood instead of likelihood?
To turn maximization into minimization, similar to training a neural network.
Is the Gaussian distribution sufficient for modeling densities in all cases?
No, it may not be satisfactory for all data distributions.
What is the problem with fitting a Gaussian distribution to bimodal data?
It induces bias and may not capture the data's characteristics.
What is a potential solution to the limitations of Gaussian distributions?
Using mixture models to capture different modes of the distribution.
What does the Gaussian Mixture Model (GMM) estimate?
The probability density with p(x) from multiple Gaussian distributions.
What does GMM ensure about the PDF?
The GMM ensures that the PDF integrates to 1, even if it is a mixture of multiple PDFs.
What algorithm is used to fit GMM to training examples?
The Expectation Maximisation (EM) algorithm is used.
What are the two main steps of the EM algorithm?
The two main steps are the E-step (expectation) and the M-step (maximisation).
What is done in the E-step of the EM algorithm?
Responsibilities for each training example and each mixture component are computed.
How is the responsibility calculated in the E-step?
Using the formula: \( r_{ik} = \frac{\pi_k \mathcal{N}(x^{(i)} | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x^{(i)} | \mu_j, \Sigma_j)} \)
What is updated in the M-step of the EM algorithm?
GMM parameters are updated using the computed responsibilities.
How is the mean updated in the M-step?
The mean is updated using: \( \mu_k = \frac{1}{N_k} \sum_{i=1}^{N} r_{ik} x^{(i)} \)
What is checked for convergence in the EM algorithm?
Convergence is checked by monitoring changes in parameters or log likelihood.
What is the Bayesian Information Criterion (BIC)?
BIC is used to select the number of components K in GMM.
What does \( \mathcal{L}(K) \) represent in the BIC formula?
\( \mathcal{L}(K) \) is the negative log likelihood.
What is the penalty term in the BIC formula?
The penalty term is \( \frac{P_k}{2} \log(N) \), which penalizes complex models.
How many parameters are needed for covariance in 2D Gaussian?
3 parameters for the covariance (symmetric 2x2 matrix).
What is the purpose of the -1 in the parameter count?
It accounts for the constraint that the sum of mixing proportions must equal 1.
What happens to BIC values as K increases?
BIC values decrease sharply then rise again due to penalty dominance.
What are the steps in cross-validation for GMM-EM?
What is a key similarity between GMM and K-means?
Both require selecting the most appropriate K value for clusters/components.
What does convergence mean in GMM and K-means?
Convergence occurs when changes in parameters are sufficiently small.
How does GMM initialization relate to K-means?
GMM means are often initialized from K-means centroid locations.
What is soft clustering in GMM?
Every point belongs to several clusters with varying degrees of membership.
What distance metric is used in GMM?
Distance is related to Mahalanobis distance, encoded by the covariance matrix.
What is a Genetic/Evolutionary Algorithm?
An optimisation method for black box functions without knowing the mathematical equation or gradient, inspired by natural evolution and genetics.
What is reinforcement learning?
Learning to maximize a numerical reward, considered an optimization problem.
What do policy search algorithms deal with?
Continuous search spaces represented as 𝑥∗ = argmax 𝑥 𝑓(𝑥).
What are Black-Box Optimisation Algorithms?
Algorithms where the links between parameters are unknown at the start of training.
What is an example of black-box optimisation in robotics?
The unknown relationship between speed and joint movements.
What are the four main concepts of Darwin's theory?
What are the three main families of genetic algorithms proposed in the 60s?
What is the main concept of genetic/evolutionary algorithms?
They have a population of solutions encoding genotypes, which are developed into phenotypes for evaluation.
What happens to the worst-performing functions in genetic algorithms?
They are removed (killed), and crossover mutation occurs with better-performing functions.
What is done to better performing functions?
Cross-over mutation is applied to generate new solutions (offsprings).
What is the result of repeating the evolutionary process?
The solution converges to an optimal high-performing solution.
How is each solution represented in evolutionary algorithms?
Each solution is represented by a genotype.
What term is used to describe the blurred lines between algorithm families?
Evolutionary Algorithms.
What does p1 represent in the Mastermind fitness function?
The number of pieces with the right color and correct position.
What does p2 represent in the Mastermind fitness function?
The number of pieces with the right color but wrong position.
What does p2 represent in the context of fitness functions?
Number of pieces with the right colour but the wrong position.
What is the goal of evolutionary algorithms regarding fitness functions?
Maximize the fitness function.
What is the fitness function for teaching a robot to walk?
F(x) = walking speed = travelled distance after a few seconds.
What is the fitness function for teaching a robot to throw an object?
F(x) = distance(object, target).
How is the phenotype created from the genotype in the Mastermind game?
Aggregate bits 3 by 3, each trio becomes an integer.
What do integers correspond to in the Mastermind game?
Different colours: (0=red, 1=yellow, 2=green, 3=blue…).
What is done with invalid genotypes in the Mastermind game?
Assigned the lowest fitness value to reduce survival chance.
What is the purpose of selection operators in evolutionary algorithms?
Select parents for the next generation.
How does the biased roulette wheel process work?
Individuals are selected based on their fitness proportion.
What is the first step in the biased roulette wheel process?
Compute the probability pi to select an individual.
What is elitism in evolutionary algorithms?
Keeping a fraction of the best individuals in the new generation.
How is standard mutation on binary strings performed?
Randomly generate a number for each bit; if lower than probability m, mutate.
What is the first step in rd mutation on binary strings?
Randomly generate a number between 0 and 1 for each bit of the genotype.
What is the purpose of the specific mutation in the Mastermind problem?
To swap groups of 3 bits in the genotype with probability m2.
What is a common stopping criterion for evolutionary algorithms?
When a specific fitness value is reached.
What is another stopping criterion besides reaching a fitness value?
After a pre-defined number of generations/evaluations.
What do we do after evaluating the population in the evolutionary loop?
Select individuals to keep for the next generation.
What is elitism in the context of evolutionary algorithms?
Keeping a few parents in the new population.
What is the main difference between genetic algorithms and evolutionary strategies?
Genotype: genetic algorithms use binary strings, evolutionary strategies use real values.
What does the μ + λ evolutionary strategy represent?
Maintains a steady population of μ + λ individuals.
What is the first step in the μ + λ evolutionary strategy?
Randomly generate a population of (μ + λ) individuals.
What is the selection process in the μ + λ strategy?
Select the μ best individuals from the population as parents.
What is the first step in the evolutionary strategy process?
Randomly generate a population of (μ + λ) individuals.
How many best individuals are selected as parents?
Select the μ best individuals from the population as parents (called x).
What is generated from the parents in the evolutionary strategy?
Generate λ offsprings (called y) from the parents.
What is the formula for generating offspring?
For offspring, use the formula: 𝑦𝑖 = 𝑥𝑗 + ℵ(0, 𝜎) where j = random individual in μ.
How is the population defined in the evolutionary strategy?
Population = union of parents and offspring: population = (∪𝑖 𝜆 𝑦𝑖) ∪ (∪𝑗 𝜇 𝑥𝑗).
What is the main challenge in evolutionary strategies?
The main challenge comes in fixing the hyperparameter 𝜎.
What happens if 𝜎 is too large?
If 𝜎 is too large, the population moves quickly to the solution but struggles to refine it.
What happens if 𝜎 is too small?
If 𝜎 is too small, the population moves slowly and might be affected by local optima.
How can 𝜎 be adjusted over time?
Change 𝜎's value over time to adapt to the situation by adding sigma into the genotype.
What is the new genotype defined as?
Define another genotype as xj’ = {xj, σj} composed of the initial genotype and sigma value.
What does the learning rate depend on?
The learning rate 𝜏0 is proportional to 1/√𝑛, where n is the number of dimensions of the genotype.
Why is substituting 𝜎 with 𝜏0 beneficial?
The selection of 𝜏0 is less critical than the value of 𝜎, allowing more flexibility in setting it.
What is the goal of taking inspiration from natural evolution?
To find effective solutions for survival and adaptation in environments.
What is the purpose of novelty search?
To use novelty instead of fitness value to drive the search for optimality.
What is the purpose of the novelty archive?
To store all encountered solutions for novelty calculation
What can happen if a feature is ignored in the behavioral descriptor?
Loss of potentially useful information
What is an example of a behavioral descriptor for a robot?
(x, y) coordinates of the robot's final position
How does novelty search differ from traditional evolutionary algorithms?
It uses novelty score instead of fitness for evaluation
What is the goal of Quality-Diversity Optimization?
To learn diverse and high-performing solutions in one process
What is a potential benefit of novelty search for a bipedal robot?
Leads to a more stable and successful robot
What is the goal of high-dimensional hyperspace exploration?
To find points that lead to the most interesting solutions.
What does the concept of behavioural descriptors help generate?
A collection of high-performing solutions with high diversity and performance.
How many degrees of freedom does the robot in the example have?
12 degrees of freedom (2 in each leg).
What is the behavioural descriptor for the robot's movement?
Proportion of time each leg touches the ground (6 dimensions).
What is the goal of varying the proportions of time each leg spends touching the ground?
To find an optimal solution for walking as fast as possible.
What are the two main focuses of Quality-Diversity (QD) algorithms?
Measuring performance of solutions and distinguishing different types of solutions.
What does the behavioural descriptor characterize in QD algorithms?
It distinguishes different types of solutions.
What does Novelty Search with Local Competition optimize?
Two fitness functions: novelty score and local competition.
What is the concept of Local Competition in QD algorithms?
Comparing new solutions only with similar ones in the same categories.
What does LC(x) represent in Local Competition?
Number of solutions that x outperforms within its k nearest neighbours.
What happens when a better version of a solution is found in the archive?
The worse version is replaced by the better one.
What is the goal of MAP-Elites?
To discretise the behavioural descriptor space in a grid and fill it with the best solutions.
How does MAP-Elites add new solutions?
If the cell is empty, the new solution is added; if occupied, the best fitness solution is kept.
What is the first step in the MAP-Elites process?
Randomly initialise some solutions to place in the grid.
What happens during the mutation operator in MAP-Elites?
Gaussian noise is added to some/all values of the selected solution.
What is a common metric for diversity in MAP-Elites?
Archive size (number of solutions stored in the collection).
What is the trade-off in QD algorithms represented by?
A Pareto-front to define the best variant of the algorithm.
What does the coverage refer to in MAP-Elites?
Number of filled cells, number of individuals, or % of filled cells in the grid.
What is the purpose of local competition in the algorithm?
To explore many different solutions in the entire space.
What is the addition mechanism in MAP-Elites?
It determines how new solutions are added to the grid based on their fitness.
What is a general framework in QD algorithms?
Allows use of different operators to define quality diversity algorithms for specific tasks.
What does the selector do in QD algorithms?
Selects the individual to be mutated and evaluated in the next generation.
What is the simplest selection method used in MAP-Elites?
Uniform random selection over the solutions in the container.
How can solutions be stored in QD?
Discretised grid (like MAP-Elites) or unstructured archive (like Novelty Search).
What is a key feature of the unstructured archive in QD?
Maintains density instead of strict discretisation.
What is the process for using advanced mutations in QD?
Select multiple operators in stochastic selection, then apply cross-over before mutation.
What is the QD algorithm for teaching a robot to walk?
Unstructured archive + random uniform selector.
What is the behavioral descriptor for the walking robot?
X/Y coordinate position of the robot after 3 seconds.
What is the fitness measure for the walking robot?
Angular error at the end of the trajectory w.r.t. an ideal circular trajectory.
What is the QD algorithm for teaching a robot to push a cube?
MAP-Elites (grid + random uniform selector).
What is the behavioral descriptor for the cube-pushing robot?
Final position of the cube, where diversity is desired.
What are genetic algorithms, evolutionary strategies, and evolutionary algorithms based on?
The same basic concepts.
Are you sure you want to delete 0 flashcard(s)? This cannot be undone.
Select tags to remove from 0 selected flashcard(s):
Loading tags...