Searching...
Flashcards in this deck (794)
  • What is the definition of artificial intelligence according to Kurzweil, 1990?

    The art of creating machines that perform functions requiring intelligence

    ai definition
  • What is Computational Intelligence according to Poole et al., 1998?

    The study of the design of intelligent agents

    computational_intelligence definition
  • What does Nilsson, 1998 say about AI?

    AI is concerned with intelligent behaviour in artifacts

    ai definition
  • What is the focus of Charnak and McDermott, 1985 regarding AI?

    The study of mental faculties through computational models

    ai definition
  • What is Winston, 1992's perspective on AI?

    The study of computations that enable perception, reasoning, and acting

    ai definition
  • What is Haugeland, 1986's definition of AI?

    The effort to make computers think like humans

    ai definition
  • What is Bellman, 1978's view on AI?

    The automation of activities associated with human thinking

    ai definition
  • What approach does this course take towards AI?

    The human route, programming computers to act humanly or learn from experience

    course_approach ai
  • What is a key application of ML mentioned?

    Robotics

    applications ml
  • What is another application of ML?

    Self-driving cars

    applications ml
  • What is an example of ML application in medicine?

    Detecting sepsis in MRI scans

    applications ml medicine
  • What is machine learning?

    The field of machine learning is concerned with constructing computer programs that automatically improve with experience.

    machine_learning definition
  • Who defined machine learning in 1977?

    Tom Mitchell.

    machine_learning history
  • What is the focus of machine learning according to Tom Mitchell's 1997 definition?

    A computer program learns from experience E with respect to tasks T and performance measure P if its performance improves with experience.

    machine_learning definition
  • What does the function 'f' calculate in the example?

    Student's grades in Intro to ML.

    machine_learning example
  • What is 'h' in the example?

    An approximation function used to estimate grades based on past data.

    machine_learning example
  • What are the three main categories of machine learning settings?

    Supervised, Unsupervised, and Reinforcement.

    machine_learning categories
  • What does supervised learning do?

    Produces a model capable of generating correct output labels.

    machine_learning supervised
  • What is unsupervised learning?

    No labels are given; algorithms find patterns in the data.

    machine_learning unsupervised
  • What is clustering in unsupervised learning?

    Dividing data into groups based on similarities, like dogs and cats.

    machine_learning clustering
  • What is dimensionality reduction?

    Identifying important features in data, like enhancing a blurry image of a face.

    machine_learning dimensionality_reduction
  • What is reinforcement learning?

    An algorithm interacts with the environment to produce a reward signal for improvement.

    machine_learning reinforcement
  • What is policy search in reinforcement learning?

    Finding actions for an agent to maximize received rewards based on its state.

    machine_learning policy_search
  • What is semi-supervised learning?

    Some data have labels, some do not; aims to label unlabelled data using labelled items.

    machine_learning semi_supervised
  • What is weakly-supervised learning?

    Inexact output labels; e.g., indicating an item is somewhere in an image without precise location.

    machine_learning weakly_supervised
  • What is classification in machine learning?

    Assigning discrete or categorical variables to inputs, like predicting actions in videos.

    machine_learning classification
  • What is binary classification?

    A classification task with only 2 labels to choose from.

    machine_learning binary_classification
  • What is multi-class classification?

    A classification task with more than 2 labels to choose from.

    machine_learning multi_class_classification
  • What is multi-label classification?

    A classification task where multiple labels can be correct for a single input.

    machine_learning multi_label_classification
  • What is regression in machine learning?

    Assigning a real/continuous float value to an input.

    machine_learning regression
  • What is simple regression?

    A regression with 1 input variable and 1 output variable.

    machine_learning simple_regression
  • What is multiple regression?

    A regression with multiple input variables and 1 output variable.

    machine_learning multiple_regression
  • What is simple regression?

    1 input variable and 1 output variable. E.g., size of a house predicts its price.

    regression simple
  • What is multiple regression?

    Multiple input variables and 1 output variable. E.g., grade calculator with 3 inputs and 1 output (grade).

    regression multiple
  • What is multivariate regression?

    Multiple inputs to predict multiple outputs. E.g., predicting the location of an umbrella from a picture.

    regression multivariate
  • What is an example regression problem?

    Given time as input, the regressor predicts the value at that time.

    regression example
  • What characterizes a bad predictor in regression?

    The line is far off from almost all points.

    regression predictor
  • What characterizes a good predictor in regression?

    The line is close to most points, even if it is off.

    regression predictor
  • What characterizes a very good predictor in regression?

    It predicts given points well but may struggle with unknown examples.

    regression predictor
  • What is supervised learning?

    Most common setting in ML problems, typically involves classification and regression.

    machine_learning supervised
  • How does Antoine classify shapes?

    By placing data along 2 axes (colour and points) to create a classifier.

    classification antoine
  • What is a linear classifier?

    A classifier that uses a straight line to separate data into categories.

    classification linear
  • What have we learnt about data in predictions?

    More data leads to more accurate predictions.

    predictions data
  • Why is selecting good features important?

    Good features improve prediction accuracy; combining features is often better.

    features predictions
  • What are two ways to make predictions?

    1. Looking at neighbors 2. Slicing space into partitions with lines.
    predictions methods
  • What is the goal of generating a model in supervised learning?

    To approximate the true function using input data to predict outputs.

    supervised_learning model
  • What is the training dataset defined as?

    A sequence of pairs of input and output labels (Xn and yn).

    dataset training
  • What is feature encoding in supervised learning?

    Transforming raw input observations into a modified version (feature space).

    features encoding
  • What is the purpose of the Xtest dataset?

    To evaluate model performance on unseen data by comparing predicted outputs with ground truth.

    testing evaluation
  • What do we compute to measure model performance?

    A score comparing predicted outputs with the ground truth/gold standard annotation.

    evaluation performance
  • What is the purpose of the truth/gold standard annotation?

    To compute a score measuring model performance.

    annotation model performance
  • What is the first step in the complete pipeline?

    Feature Encoding

    pipeline feature encoding
  • Why is it important to examine data before designing an algorithm?

    It can provide clues for classifier design and help identify class label distribution.

    data algorithm design
  • What happens if class labels are imbalanced?

    The algorithm may learn to identify only the majority class.

    class imbalance algorithm
  • What should you do with features before starting an algorithm?

    Normalize your features.

    features normalization
  • How do you normalize features?

    Subtract the mean and divide by the standard deviation.

    normalization features
  • What is the curse of dimensionality?

    As dimensions increase, data becomes sparse and training data may be noisy.

    dimensionality sparse data
  • What is feature selection?

    Choosing a subset of original features to work with.

    feature selection
  • What is feature extraction?

    Generating a new set of features from the original features.

    feature extraction
  • What is the Bag of Words method in NLP?

    Logging the frequency of words without tracking their positions.

    nlp bagofwords method
  • What is the modern approach to feature encoding in deep learning?

    Letting the algorithm figure out optimal features from raw data.

    deep learning feature
  • What is a lazy learner?

    Stores training examples and generalizes upon explicit request at test time.

    lazy learning algorithm
  • What is an eager learner?

    Constructs a general description of the target function before test time.

    eager learning algorithm
  • What is the opposite of the other guy in ML models?

    Learns and generalises all it can before test time, resulting in quicker test time.

    machine_learning models
  • What is a Non-Parametric Model?

    Assumes no fixed form; trusts the data instead of a function.

    machine_learning non_parametric
  • What is an example of a Non-Parametric Model?

    Nearest neighbour is a lazy learner.

    machine_learning nearest_neighbour
  • How does a nearest neighbour classifier work?

    Looks at the nearest neighbour and classifies itself as the same.

    machine_learning classification
  • What is a Linear Model?

    Assumes the data is linearly separable, learning the best line to separate it.

    machine_learning linear_model
  • What does a Linear Model classify?

    Anything on the left as a green diamond, anything on the right as a red circle.

    machine_learning classification
  • What is a Non-Linear Model?

    Used for non-linearly separable problems with more complex models.

    machine_learning non_linear_model
  • What is Feature Space Transformation?

    Representing data differently to analyze and separate it more easily.

    machine_learning feature_transformation
  • How do SVMs solve non-linear datasets?

    Use a kernel for transformation.

    machine_learning svm
  • How do Neural Networks handle non-linear datasets?

    Try to learn how to transform the feature space automatically.

    machine_learning neural_networks
  • What is the Bias-Variance trade-off?

    A balance between overfitting (high variance) and underfitting (high bias).

    machine_learning bias_variance
  • What is Occam’s razor in ML?

    Choose the simpler model if two models perform similarly.

    machine_learning occams_razor
  • What does MSE stand for?

    Mean Squared Error, measures average square distance between correct and predicted outputs.

    machine_learning mse
  • Is 85% accuracy good?

    Accuracy is relative; depends on baseline and upper bound performance.

    machine_learning accuracy
  • What is the Baseline in performance evaluation?

    The lower bound for performance, often chance/random performance.

    machine_learning baseline
  • What is the Upper bound in performance evaluation?

    The best case, often compared to human performance.

    machine_learning upper_bound
  • What is K-Nearest Neighbours?

    A lazy learner that stores data until a request is made.

    machine_learning knn
  • What are Decision Trees in ML?

    Eager learners that process all data upfront and discard it after analysis.

    machine_learning decision_trees
  • What does the Nearest Neighbour Classifier do?

    Classifies a test instance to the class label of the nearest training instance.

    machine_learning nearest_neighbour_classifier
  • What does k-NN stand for?

    k-nearest neighbours

    machine_learning k-nn
  • What type of model is k-NN?

    Non-parametric model

    machine_learning model_types
  • What is a major problem with k-NN?

    Sensitive to noise

    machine_learning problems
  • What is the solution to overfitting in k-NN?

    Use k nearest neighbours

    machine_learning solutions
  • What does increasing k do to the classifier?

    Makes the decision boundary smoother and less sensitive to training data

    machine_learning k-nn
  • How should k be chosen in k-NN?

    Using a validation dataset

    machine_learning k-nn validation
  • What are some distance metrics used in k-NN?

    Mahalanobis distance, Hamming distance

    machine_learning distance_metrics
  • What does distance-weighted k-NN do?

    Assigns weights to neighbours based on their distance

    machine_learning k-nn weights
  • What happens if k=N in weighted k-NN?

    It becomes a global method

    machine_learning k-nn global_method
  • What is a disadvantage of k-NN for large datasets?

    It can be slow

    machine_learning k-nn performance
  • What is the curse of dimensionality in k-NN?

    Distance metrics may not work well in high dimensional spaces

    machine_learning curse_of_dimensionality
  • How does k-NN perform regression?

    Computes the mean value across k nearest neighbours

    machine_learning k-nn regression
  • What is the principle of decision trees?

    Focus on a specific subset or feature to make decisions

    machine_learning decision_trees
  • What type of learners are decision trees?

    Eager learners

    machine_learning decision_trees
  • What is decision tree learning?

    A method for approximating discrete classification functions using a tree-based representation.

    machine_learning decision_tree
  • How can decision trees be represented?

    As a set of if-then rules.

    machine_learning decision_tree
  • What type of search do decision tree learning algorithms use?

    Top-down greedy search through the space of possible solutions.

    machine_learning algorithms
  • Name some algorithms for constructing decision trees.

    ID3, C4.5, CART.

    machine_learning algorithms
  • What is the first step in the general decision tree algorithm?

    Search for the optimal splitting rule on training data.

    machine_learning decision_tree
  • What is the goal of finding an optimal split rule?

    To create partitioned datasets that are more 'pure' than the original dataset.

    machine_learning decision_tree
  • What does Information Gain measure?

    The reduction of information entropy.

    machine_learning id3
  • What does Gini Impurity measure?

    The probability of incorrectly classifying a randomly picked point according to class label distribution.

    machine_learning cart
  • What is Variance Reduction mainly used for?

    Regression trees where the target variable is continuous.

    machine_learning cart
  • Who introduced the concept of entropy in information theory?

    Claude Shannon (1916-2001).

    information_theory entropy
  • What does entropy measure?

    The uncertainty of a random variable.

    information_theory entropy
  • What is the formula for the amount of information required to determine the state of a random variable?

    I(x) = log2(K).

    information_theory entropy
  • How is the amount of information related to probability?

    I(x) = -log2(P(x)).

    information_theory probability
  • What happens to information required when the impostor is more likely in one box?

    Low entropy; less new information is gained.

    information_theory entropy
  • What is the information required when the impostor is equally likely in 4 boxes?

    I(x) = -log2(1/4) = 2 bits.

    information_theory entropy
  • What does low entropy indicate?

    You don’t need to know a lot of information to predict the value of a random variable.

    information_theory entropy
  • What does high entropy indicate?

    A lot of new information is gained when predicting the value of a random variable.

    information_theory entropy
  • What is the entropy of box 1?

    0.0439 bits (LOW entropy)

    entropy information
  • What is the entropy of box 2?

    6.6439 bits (HIGH entropy)

    entropy information
  • How is entropy defined?

    Average amount of information: 𝐻(𝑋) = −∑ 𝑃(𝑥𝑘)𝑙𝑜𝑔2(𝑃(𝑥𝑘))

    entropy definition
  • What is the continuous entropy formula?

    𝐻(𝑋) = −∫ 𝑓(𝑥)𝑙𝑜𝑔2(𝑓(𝑥))𝑑𝑥

    entropy continuous
  • What does a 50:50 split of information represent?

    Average entropy of 1 (more random outcome)

    entropy randomness
  • What is information gain?

    Difference between initial entropy and weighted average entropy of subsets.

    information gain
  • What is the formula for information gain?

    𝐼𝐺(𝑑𝑎𝑡𝑎𝑠𝑒𝑡, 𝑠𝑢𝑏𝑠𝑒𝑡𝑠) = 𝐻(𝑑𝑎𝑡𝑎𝑠𝑒𝑡) − ∑ |𝑆|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆)

    information gain
  • What is the binary tree information gain formula?

    𝐼𝐺(𝑑𝑎𝑡𝑎𝑠𝑒𝑡, 𝑠𝑢𝑏𝑠𝑒𝑡𝑠) = 𝐻(𝑑𝑎𝑡𝑎𝑠𝑒𝑡) − (|𝑆𝑙𝑒𝑓𝑡|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆𝑙𝑒𝑓𝑡) + |𝑆𝑟𝑖𝑔ℎ𝑡|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆𝑟𝑖𝑔ℎ𝑡))

    information gain
  • What are ordered values in decision trees?

    Attribute and split point (e.g., weight < 60)

    decision_trees ordered_values
  • What are categorical values in decision trees?

    Search for the most informative feature, create branches for each value.

    decision_trees categorical_values
  • What is the first step in using ID3 algorithm?

    Find the entropy of the initial dataset.

    id3 algorithm
  • What is the entropy of the dataset D with 9 positive and 5 negative outcomes?

    𝐻(𝐷) = 0.940

    entropy dataset
  • What is the entropy for 'sunny' outcomes?

    𝐻(𝐷𝑠𝑢𝑛𝑛𝑦) = 0.971

    entropy sunny
  • What is the entropy for 'overcast' outcomes?

    𝐻(𝐷𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = 0

    entropy overcast
  • What is the entropy for 'rain' outcomes?

    𝐻(𝐷𝑟𝑎𝑖𝑛) = 0.971

    entropy rain
  • What is the formula for information gain for 'outlook'?

    IG(D, outlook) = H(D) - (5/14 H(Dsunny) + 4/14 H(Dovercast) + 5/14 H(Drain))

    information_gain outlook
  • What is the total number of days in the dataset?

    14 days

    data total_days
  • What is the information gain for 'outlook'?

    0.246

    information_gain outlook
  • What happens to the 'overcast' subset?

    It is labeled as a tick since all outcomes are positive (1).

    decision_tree overcast
  • What is a common issue with decision trees?

    They can overfit the data.

    decision_trees overfitting
  • What is one method to deal with overfitting in decision trees?

    Stopping early or pruning the tree.

    overfitting pruning
  • What is the validation set size in cross-validation?

    20% of the provided data.

    cross_validation validation_set
  • What is the first step in pruning a decision tree?

    Go through each internal node connected only to leaf nodes.

    pruning decision_tree
  • What does a random forest consist of?

    A collection of decision trees trained on different subsets of data.

    random_forests decision_trees
  • What is the outcome of the algorithm in a random forest?

    The majority vote by all the different trees.

    random_forests algorithm_outcome
  • What do regression trees predict?

    A real-valued number instead of a class label.

    regression_trees prediction
  • What is used instead of information gain for regression trees?

    Variance reduction.

    regression_trees splitting_metric
  • How do you make predictions with regression trees?

    By taking an average or weighted average of samples in the leaves.

    regression_trees predictions
  • What is the purpose of taking an average in machine learning predictions?

    To make predictions based on the distance of different samples in the leaves of the tree.

    machine_learning prediction
  • What is the ultimate goal when creating machine learning systems?

    To develop models that generalise to previously unseen examples.

    machine_learning goals
  • What is a held-out test dataset used for?

    To measure the performance of a model on unknown data.

    machine_learning evaluation
  • Why is shuffling important before splitting a dataset?

    To avoid implicit ordering in the dataset that can bias results.

    machine_learning data_management
  • What are hyperparameters in machine learning?

    Model parameters chosen before training, such as 'k' in k-NN.

    machine_learning hyperparameters
  • What is the motivation behind hyperparameter tuning?

    To choose hyperparameter values that give the best performance.

    machine_learning hyperparameter_tuning
  • What is a disadvantage of testing hyperparameters on the training dataset?

    It usually does not generalise well to unseen examples.

    machine_learning evaluation
  • What should never be done when evaluating hyperparameters?

    Using the test dataset to select hyperparameters based on accuracy.

    machine_learning evaluation
  • What is the correct approach for dataset splitting in machine learning?

    Split into training, validation, and test sets, e.g., 60:20:20.

    machine_learning data_management
  • What is the purpose of the validation set?

    To select the best hyperparameters based on accuracy.

    machine_learning hyperparameter_tuning
  • What is hyperparameter tuning/optimisation?

    Selecting parameters that produce the best classifier performance.

    machine_learning hyperparameter_tuning
  • What can be done for final evaluation after hyperparameter tuning?

    Optionally include the validation set back into the training set.

    machine_learning evaluation
  • What can be included in the training set for final evaluation?

    Validation set can be included to retrain the model on the whole dataset after finding best hyperparameters.

    evaluation model hyperparameters
  • What is the purpose of including the validation set in training?

    It provides more data for training, potentially increasing model performance.

    training validation performance
  • When is the final evaluation done?

    The final evaluation is done on the test dataset.

    evaluation test dataset
  • What is a risk of developing and evaluating a model on the same data?

    It results in overfitting the model to the training data.

    overfitting model evaluation
  • What should the test set be used for?

    The test set should only be used for estimating performance on unknown examples.

    test performance evaluation
  • What is cross-validation used for?

    Cross-validation is used when the dataset is small to ensure effective testing.

    cross-validation testing dataset
  • What are the steps in cross-validation?

    1. Divide dataset into k folds. 2. Use k-1 for training, 1 for testing. 3. Iterate k times.
    cross-validation steps method
  • What does the global error estimate formula represent?

    It averages performance metrics across all k held-out test sets.

    error estimation metrics
  • What is important about cross-validation in model evaluation?

    It evaluates an algorithm rather than a single trained instance of a model.

    evaluation algorithm model
  • What is one option for parameter tuning during cross-validation?

    Use 1 fold for testing, 1 for validation, and k-2 for training in each iteration.

    parameter_tuning cross-validation training
  • What is an alternative method for parameter tuning in cross-validation?

    Cross-validation within cross-validation, separating 1 fold for testing.

    parameter_tuning cross-validation method
  • How does the second option for parameter tuning help?

    It allows for optimal hyperparameters to be found using more data.

    parameter_tuning hyperparameters data
  • What is the advantage of using different hyperparameters on each fold during cross-validation?

    It likely leads to the best results for small data sets.

    cross-validation hyperparameters advantages
  • What is a disadvantage of using different hyperparameters on each fold?

    It requires more work and experiments than simpler methods and is not practical in all situations due to high computation needs.

    cross-validation hyperparameters disadvantages
  • What is the advantage of testing on all data when going into production?

    You can use all available data to train the model for better performance.

    production testing advantages
  • What is a disadvantage of testing on all data?

    You cannot estimate the performance of the final trained model anymore; you rely on hyperparameters generalizing.

    production testing disadvantages
  • What are the steps in CASE 1 for plenty of data available?

    1. Train on training set. 2. Tune on validation set. 3. Estimate performance using the test set.
    parameter_optimisation performance_estimation steps
  • What are the steps in CASE 2 for limited data available?

    1. Separate dataset into k folds. 2. Use 1 fold for testing, k-1 for training/validation. 3. Repeat k times. 4. Average results for performance estimation.
    parameter_optimisation performance_estimation steps
  • What does a confusion matrix represent?

    It visualizes performance, showing true labels vs. predicted labels, allowing analysis of model performance.

    evaluation_metrics confusion_matrix performance
  • What is accuracy in model evaluation?

    Accuracy = (TP + TN) / (TP + TN + FP + FN).

    evaluation_metrics accuracy formulas
  • How is classification error calculated?

    Classification error = 1 - accuracy.

    evaluation_metrics classification_error formulas
  • What is precision in model evaluation?

    Precision = TP / (TP + FP). It measures the correctness of positive predictions.

    evaluation_metrics precision formulas
  • What does high precision indicate?

    If a model predicts something as positive, it is likely to be correct.

    evaluation_metrics precision interpretation
  • What is recall in model evaluation?

    Recall = TP / (TP + FN). It measures the ability to find all positive examples.

    evaluation_metrics recall formulas
  • What is the precision for Class 1?

    60%

    precision class1
  • What is the formula for recall?

    Recall = \( \frac{TP}{TP + FN} \)

    recall formula
  • What is the recall for Class 1?

    75%

    recall class1
  • What does high recall indicate?

    Good at retrieving positive examples, but may include false positives.

    recall performance
  • What is the trade-off between precision and recall?

    High precision often leads to low recall and vice versa.

    precision recall tradeoff
  • What is macro-averaged recall for two classes?

    62.5%

    macro-averaging recall
  • What does the F-measure combine?

    It combines precision and recall into a single score.

    f-measure metrics
  • What is the formula for F1 score?

    \( F1 = \frac{2 \cdot precision \cdot recall}{precision + recall} \)

    f1 formula
  • What does a confusion matrix evaluate?

    It evaluates performance in multi-class classification.

    confusion_matrix evaluation
  • What is accuracy in classification?

    Accuracy = \( \frac{Number \ of \ correctly \ classified \ examples}{Total \ number \ of \ examples} \)

    accuracy classification
  • What is the difference between micro-averaging and macro-averaging?

    Macro-averaging averages metrics at the class level; micro-averaging at the item level.

    averaging metrics
  • What is the effect of micro-averaging on precision, recall, and F1 in binary and multi-class classification?

    They equal accuracy.

    micro-averaging accuracy
  • What is micro-averaged precision, recall, and F1 equal to?

    Accuracy

    metrics classification
  • What is the most common evaluation metric for regression tasks?

    Mean Squared Error (MSE)

    regression metrics
  • How is MSE calculated?

    MSE = rac{1}{N} imes ext{sum}((Y_i - ilde{Y}_i)^2)

    regression mse
  • What does a lower MSE indicate?

    Better predictions

    regression mse
  • What does RMSE stand for?

    Root Mean Squared Error

    regression rmse
  • How is RMSE calculated?

    RMSE = ext{sqrt}(MSE)

    regression rmse
  • What are the five important model characteristics in ML?

    Accurate, Fast, Scalable, Simple, Interpretable

    ml model characteristics
  • What is a balanced dataset?

    Equal number of examples in each class

    data dataset
  • What is an imbalanced dataset?

    Classes are not equally represented

    data dataset
  • What can affect accuracy in imbalanced datasets?

    Performance of the majority class

    metrics accuracy
  • What does macro-averaged recall help detect?

    If one class is completely misclassified

    metrics recall
  • What is a solution for imbalanced test sets?

    Normalize counts in the confusion matrix

    data solution
  • What does a normalized confusion matrix achieve?

    Calculates metrics as if evaluated on a balanced dataset

    data normalization
  • What is one view of system performance on a balanced test set?

    The classifier's performance remains the same.

    performance classifier
  • What should be evaluated for a more realistic scenario?

    The system should be evaluated with data having a realistic distribution.

    evaluation realistic
  • What is one solution to balance classes?

    Down-sample the majority class.

    solutions balancing
  • What is another solution to balance classes?

    Up-sample the minority class.

    solutions balancing
  • What does overfitting indicate about model performance?

    Good performance on training data, but poor generalization to other data.

    overfitting performance
  • What does underfitting indicate about model performance?

    Poor performance on both training and test data.

    underfitting performance
  • What happens to classification error as models learn?

    Classification error decreases for training but may increase for test data.

    classification error
  • What can cause overfitting?

    A model that is too complex or training data that is not representative.

    overfitting causes
  • How can we fight overfitting?

    Choose optimal hyperparameters and use regularization.

    overfitting solutions
  • What is a confidence interval?

    A way to quantify confidence in an evaluation result.

    confidence evaluation
  • What affects confidence in an evaluation result?

    The size of the test set.

    confidence testset
  • What is the impact of a small test set on accuracy?

    90% accuracy on 10 samples differs from 84% accuracy.

    accuracy testset
  • What affects confidence in evaluation results?

    The size of the test set affects confidence in evaluation results.

    confidence evaluation
  • What is true error?

    True error is the probability that the model misclassifies a randomly drawn example from a distribution.

    error model
  • How is true error mathematically defined?

    True error is defined as: 𝑒𝑟𝑟𝑜𝑟𝐷(ℎ) ≡ Pr[𝑓(𝑥) ≠ ℎ(𝑥)].

    error mathematics
  • What is sample error?

    Sample error is the classification error based on a sample from the underlying distribution.

    error sample
  • How is sample error mathematically defined?

    Sample error is defined as: 𝑒𝑟𝑟𝑜𝑟𝑆(ℎ) ≡ (1/N) ∑ 𝛿(𝑓(𝑥), ℎ(𝑥)) for x ∈ S.

    error mathematics
  • What does 𝛿(𝑓(𝑥), ℎ(𝑥)) represent?

    𝛿(𝑓(𝑥), ℎ(𝑥)) = 1 if f(x) ≠ h(x), 0 if f(x) = h(x).

    error classification
  • What is a confidence interval?

    An N% confidence interval is an interval that is expected with probability N% to contain the parameter q.

    confidence interval
  • What does a 95% confidence interval [0.2, 0.4] mean?

    It means that with probability 95%, the true parameter q lies between 0.2 and 0.4.

    confidence interval
  • How does sample size affect confidence intervals?

    As sample size n increases, confidence interval boundaries get closer to 0, leading to narrower intervals.

    confidence sample_size
  • What is the example confidence interval for errorS(h) = 0.22 with n = 50?

    With n = 50 and ZN = 1.96, the confidence interval for errorD(h) is quite large (22%).

    confidence example
  • What does statistical significance testing help determine?

    Statistical significance testing helps determine if there is a difference between two distributions of classification errors.

    significance testing
  • What does a graph with overlapping distributions indicate?

    Overlapping distributions indicate uncertainty about which classifier is better due to sampling error.

    graphs distributions
  • What is the Marek ApprovedTM test?

    The Marek ApprovedTM test is the Randomisation test, considered intuitive for comparing algorithms.

    testing algorithms
  • What is the Marek ApprovedTM test?

    The Marek ApprovedTM test is the Randomisation test, as it is the most intuitive to him.

    statistics testing
  • What do statistical tests determine?

    Statistical tests tell us if the means of two sets are significantly different.

    statistics tests
  • Name three statistical tests mentioned.

    Randomisation, T-test, Wilcoxon rank-sum.

    statistics tests
  • How does the Randomisation test work?

    It randomly switches predictions between two systems and measures if the performance difference is greater or equal to the original difference.

    statistics randomisation
  • What does a small p-value indicate?

    A small p-value means we can be more confident that one system is different from the other.

    statistics p-value
  • What is the null hypothesis?

    The null hypothesis states that the two algorithms/models perform the same and differences are due to sampling error.

    statistics hypothesis
  • What is the significance level for performance difference?

    Performance difference is statistically significant if p < 0.05 (5%).

    statistics significance
  • What is P-hacking?

    P-hacking is the misuse of data analysis to find patterns that appear statistically significant without an underlying effect.

    statistics p-hacking
  • What happens if the number of experiments increases in P-hacking?

    Increasing experiments can lead to a higher false discovery proportion, even if true discoveries remain the same.

    statistics p-hacking
  • What is the false positive rate in the example of P-hacking?

    P(false positive) = 0.05, the same as the significance level.

    statistics false_positive
  • What is the false discovery proportion in the initial example?

    The false discovery proportion is 35 / 115 = 30%.

    statistics false_discovery
  • What happens to the false discovery proportion when experiments increase to 2400?

    The false discovery proportion increases to 80 / 195 = 59%.

    statistics false_discovery
  • How many true discoveries were made?

    80 true discoveries

    statistics research
  • How many false discoveries were made?

    115 false discoveries

    statistics research
  • What is the false discovery proportion?

    59%

    statistics research
  • What is the sample size of the 'study'?

    54 people

    statistics sample_size
  • How many possible relations were searched in the 'study'?

    27,716 possible relations

    statistics research
  • What is a method to defend against unintentional p-hacking?

    Adaptive threshold for calculating p-value (Benjamini & Hochberg, 1995)

    statistics p-hacking
  • What is the first step in the Benjamini-Hochberg method?

    Rank the p-values from the M experiments

    statistics p-hacking
  • What does the Bejamini-Hochberg critical value formula represent?

    New significance threshold (critical value)

    statistics p-hacking
  • What is the original significance threshold in the Benjamini-Hochberg method?

    5%

    statistics p-hacking
  • What is the downside of the Benjamini-Hochberg method?

    Thresholds for most experiments will be lower than the original 5%

    statistics p-hacking
  • What are Artificial Neural Networks (ANNs)?

    A class of ML algorithms optimized with gradient descent

    machine_learning neural_networks
  • What does Deep Learning refer to?

    Using neural network models with multiple hidden layers

    machine_learning neural_networks
  • Why has deep learning become more popular now?

    Better conditions for implementation, like big data and faster hardware

    machine_learning neural_networks
  • What are perceptrons?

    An early version of neural networks proposed in 1958 by Rosenblatt

    machine_learning neural_networks
  • What is backpropagation?

    Described in 1974 by Werbos, it is a training algorithm for neural networks

    machine_learning neural_networks
  • What are LSTMs and CNNs?

    Key components of modern neural network architectures described in the late '90s

    machine_learning neural_networks
  • What is a benefit of having large datasets for neural networks?

    They improve training efficiency and effectiveness

    machine_learning data
  • What advancements have improved neural network training?

    Better CPUs and GPUs for efficient computation

    machine_learning hardware
  • What operations can be efficiently parallelized on graphics cards?

    Matrix operations

    graphics computing
  • What has improved the accessibility and affordability of graphics cards?

    Increased efficiency and reduced cost

    graphics cost
  • What are automatic differentiation libraries used for?

    They handle back propagation and optimization of model parameters

    software differentiation
  • What is linear regression useful for in machine learning?

    It serves as a stepping stone towards neural network models

    machine_learning linear_regression
  • What type of learning is linear regression?

    Supervised learning

    learning supervised
  • What does the dataset in supervised learning consist of?

    Input and output pairs

    dataset supervised
  • What is the goal of supervised learning?

    Learn the mapping f: X → Y

    goal learning
  • What does the function f represent in linear regression?

    The mapping from inputs to outputs

    function linear_regression
  • What are the desired labels in classification problems?

    Discrete labels

    classification labels
  • What are the desired labels in regression problems?

    Continuous labels

    regression labels
  • What controls the gradient of a straight line in linear regression?

    The parameter 'a'

    linear_regression gradient
  • What does the parameter 'b' represent in linear regression?

    The y-intercept

    linear_regression intercept
  • What does the loss function measure in linear regression?

    How well we are performing on our dataset

    loss performance
  • What is the formula for the loss function in linear regression?

    E = (1/2) * Σ(ŷ(i) − y(i))^2

    loss formula
  • What does a smaller value of E indicate?

    Predictions are close to real values

    e_value predictions
  • What do derivatives show in the context of linear regression?

    How to change each parameter value to reduce loss

    derivatives parameters
  • What is the purpose of gradient descent?

    To repeatedly update parameters a and b

    gradient_descent optimization
  • What does the learning rate (α) control in gradient descent?

    The step size for updating parameters

    learning_rate gradient_descent
  • What is the learning rate in gradient descent?

    The learning rate, denoted as 𝛼, is a hyperparameter that determines the size of the steps taken towards the minimum of the loss function.

    machine_learning hyperparameter
  • What does 𝜕𝐸/𝜕𝑎 represent?

    It represents the partial derivative of the loss function with respect to parameter 𝑎.

    calculus derivative
  • What is the formula for updating parameter 𝑎?

    The update rule is: 𝑎𝑛𝑒𝑤 := 𝑎𝑜𝑙𝑑 - 𝛼 ∑(ax(𝑖) + 𝑏 - 𝑦(𝑖))𝑥𝑖/N, where N is the total number of data points.

    machine_learning gradient_descent
  • What does an epoch represent in machine learning?

    An epoch is one complete pass over the entire dataset during training.

    machine_learning training
  • What is the gradient in vector notation?

    The gradient is a vector of all partial derivatives for a function with K parameters: ∇𝜃f(𝜃) = [𝜕f(𝜃)/𝜕𝜃1, 𝜕f(𝜃)/𝜕𝜃2, ..., 𝜕f(𝜃)/𝜕𝜃𝐾].

    calculus gradient
  • What is the analytical solution for linear regression?

    The analytical solution allows finding optimal parameters without iterating through epochs by solving a specific equation.

    machine_learning linear_regression
  • What is the complexity of matrix inversion?

    Matrix inversion has cubic complexity, making it computationally expensive for large problems.

    computational_complexity matrix
  • What is multiple linear regression?

    Multiple linear regression uses multiple input features, each with its own parameter, to predict an output value.

    machine_learning linear_regression
  • How does the RMSE change with multiple features?

    The RMSE (Root Mean Square Error) is typically lower with multiple features due to increased information for prediction.

    machine_learning evaluation
  • What is RMSE in model evaluation?

    Root Mean Square Error (RMSE) measures the differences between predicted and observed values; lower RMSE indicates better model accuracy.

    modeling evaluation
  • How does using more features affect model predictions?

    Using more features provides more information, leading to more accurate predictions in the model.

    features accuracy
  • What does a linear regression model represent in higher dimensions?

    In higher dimensions, the linear regression model is a continuous linear plane representing the learned data.

    linear_regression dimensions
  • What is the role of the nucleus in a biological neuron?

    The nucleus acts like the neuron's brain, telling it what to do.

    biology neurons
  • What do dendrites do in a biological neuron?

    Dendrites connect to other neurons and receive signals from them.

    biology neurons
  • What happens when a biological neuron's axon fires?

    When conditions are right, the axon fires a signal to connect with other neurons' dendrites.

    biology neurons
  • What are input features in an artificial neuron?

    Input features (xi) are the values fed into the artificial neuron, each with an associated weight (θi).

    artificial_neuron features
  • What determines the importance of a feature in an artificial neuron?

    The weight (θi) associated with each input feature determines its importance in the artificial neuron.

    artificial_neuron weights
  • What does the output of an artificial neuron involve?

    The output involves multiplying features and weights, and adding the bias (b).

    artificial_neuron output
  • What is the activation function in an artificial neuron?

    The activation function (g) transforms the output of the linear equation into a new value.

    artificial_neuron activation_function
  • How can the bias term be included in the equation?

    The bias term can be included by reformulating the equation to add an extra feature and weight for the bias.

    artificial_neuron bias
  • What is the vector notation for input features and weights?

    Input features and weights can be represented as vectors: x = [x1, x2, ..., xK], W = [θ1, θ2, ..., θK].

    vector_notation features
  • What is the logistic activation function used for?

    The logistic function (sigmoid) squashes any value into a range between 0 and 1.

    activation_function logistic_function
  • What does logistic regression actually do?

    Logistic regression performs binary classification using the logistic function, not actual regression.

    logistic_regression classification
  • How is the logistic regression model optimized?

    The logistic regression model is optimized using gradient descent.

    logistic_regression optimization
  • What is a perceptron?

    A perceptron is an algorithm for supervised binary classification, an early version of an artificial neuron.

    perceptron classification
  • What activation function does a perceptron use?

    A perceptron uses a threshold function as its activation function, outputting 0 until a certain limit is reached.

    perceptron activation_function
  • What does gradient descent use as its activation function?

    A threshold function that outputs 0 until a limit (θ) is reached, then outputs 1.

    gradient_descent activation_function
  • What is the output of the activation function in the perceptron?

    1 if WT x > 0, otherwise 0.

    perceptron activation_function
  • What is the perceptron learning rule update formula?

    θ_new ← θ_old + α(y - h(x))xi

    perceptron learning_rule
  • What happens when y = 1 and h(x) = 0?

    Weight θi is increased if xi is positive, decreased if negative.

    perceptron weight_update
  • What happens when y = 0 and h(x) = 1?

    We want to decrease the summation, so we do the opposite to reduce WT x.

    perceptron weight_update
  • What types of functions can a perceptron learn?

    Any linearly separable function, like logical OR.

    perceptron functions
  • Why can't a perceptron learn XOR?

    XOR is not linearly separable; one linear line cannot separate the classes.

    perceptron xor
  • What is a weakness of using a single neuron?

    It cannot classify complex relationships like XOR.

    perceptron weaknesses
  • What is needed to model complex relationships in data?

    Multi-layer neural networks are required.

    neural_networks complex_relationships
  • What is a multi-layer perceptron (MLP)?

    A network that connects neurons in sequence to learn higher order features.

    mlp neural_networks
  • What is the role of hidden layers in a neural network?

    They process features and are not visible from the outside.

    hidden_layers neural_networks
  • What does each block in a block diagram represent?

    A layer of the model with multiple neurons.

    block_diagram neural_networks
  • What is the first and last layer of a neural network called?

    The first layer is the input layer and the last is the output layer.

    neural_networks layers
  • What should you check when something isn’t working in a neural network?

    Ensure that the matrix dimensions match.

    troubleshooting neural_networks
  • What is b in the context of a neural network layer?

    The layer-specific bias vector, unique to each neuron in a layer.

    bias neural_networks
  • What should you check when working with matrices?

    Matrix dimensions must match.

    matrices dimensions
  • What is 'b_' in a neural network?

    Layer-specific bias vector for each neuron.

    neural_networks bias
  • How many neurons are typically in deep neural networks?

    Thousands or millions of neurons.

    neural_networks neurons
  • What can multi-layer neural networks learn?

    Useful representations and features.

    neural_networks features
  • What was the approach to feature crafting before multi-layer networks?

    Manually crafting features for pattern recognition.

    pattern_recognition features
  • What is end-to-end learning?

    Allowing the network to learn features from raw input.

    machine_learning end-to-end
  • What do lower levels of a neural network act as?

    Feature extractors.

    neural_networks feature_extraction
  • What do higher levels of a neural network learn?

    Act as the classification layer.

    neural_networks classification
  • What is the benefit of training both feature extraction and classification layers together?

    They optimize each other based on data.

    optimization training
  • What should you use if the data is linearly separable?

    A linear function for the model.

    activation_functions linear
  • What happens if we only use linear activation functions in a multi-layer network?

    It becomes equivalent to a single-layer network.

    activation_functions linear
  • What is the simplest activation function?

    Linear activation (identity function).

    activation_functions linear
  • What does the output of a neuron with linear activation become?

    ŷ = f(𝑊𝑇𝑥) = 𝑊𝑇𝑥.

    activation_functions output
  • What is the equation for output in a two-layer network?

    ŷ = W1(𝑊2𝑥) → 𝑦 = 𝑈𝑥 where 𝑈 = 𝑊1𝑊2.

    neural_networks equation
  • What is the equation for a linear activation function?

    ŷ = W1(W2x) → y = Ux where U = W1W2

    activation linear
  • What happens when a two-layer network uses linear activation?

    It collapses into a single-layer network, unable to capture complex non-linear patterns.

    network linear
  • What do non-linear activation functions do?

    They allow models to learn complicated patterns by breaking the dependency of multiple layers collapsing into one.

    activation non-linear
  • What is the range of the sigmoid activation function?

    The sigmoid function compresses output into the range between 0 and 1.

    activation sigmoid
  • What is the formula for the sigmoid activation function?

    f(x) = σ(x) = 1 / (1 + e^(-x))

    activation sigmoid
  • What is the range of the tanh activation function?

    The tanh function maps input values to the range -1 to 1.

    activation tanh
  • What is the formula for the tanh activation function?

    f(x) = tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

    activation tanh
  • What characterizes the ReLU activation function?

    ReLU is linear and unbounded in the positive part, but non-linear overall.

    activation relu
  • What is the formula for the ReLU activation function?

    f(x) = ReLU(x) = { 0 for x ≤ 0; x for x > 0 }

    activation relu
  • What does the softmax activation function do?

    It scales inputs into a probability distribution that sums to 1.

    activation softmax
  • What is the formula for the softmax activation function?

    softmax(zi) = e^(zi) / ∑ e^(zk)

    activation softmax
  • What is a common activation function for deep neural networks?

    ReLU is commonly used in very deep neural networks, especially for image recognition.

    activation relu
  • Which activation functions are more robust than ReLU?

    Tanh and sigmoid are more robust than ReLU.

    activation robustness
  • What is a potential issue with using ReLU?

    ReLU can produce unbounded values, leading to confusion in the network.

    activation relu
  • What should you try first when designing models?

    Experiment with tanh and sigmoid first, as they are bounded.

    activation modeling
  • How should the choice of activation function in hidden layers be treated?

    It is a hyperparameter that can be set empirically or optimized using a development set.

    activation hyperparameter
  • How can we set hyperparameters for activation functions?

    Empirically or using a development set to find the best performing function for the model and dataset.

    hyperparameters activation_functions
  • What determines the choice of activation function in the output layer?

    It depends on the task.

    activation_function output_layer
  • What activation function is commonly used for binary classification?

    Sigmoid is most common; tanh can also be used.

    binary_classification activation_function
  • What activation function should be used for predicting unbounded scores?

    Use a linear activation function.

    unbounded_scores activation_function
  • What activation function is most commonly used for predicting a probability distribution?

    Softmax is used for multi-class classification.

    probability_distribution softmax
  • What does Softmax do?

    It scales values into a probability distribution, making them sum to 1.

    softmax probability_distribution
  • What is the input dimension for the neural network in PyTorch?

    The input dimension is 10.

    pytorch neural_network input_dimension
  • How many neurons are in the hidden layer of the PyTorch neural network?

    There are 5 neurons in the hidden layer.

    pytorch neural_network hidden_layer
  • What is the output dimension of the PyTorch neural network?

    The output dimension is 1.

    pytorch neural_network output_dimension
  • What activation function is applied in the hidden layer during the forward pass?

    Tanh is used as the activation function.

    forward_pass activation_function tanh
  • What is the purpose of the loss function in neural networks?

    To minimize and show performance on a specific task.

    loss_function optimization
  • How do we update parameters in neural networks?

    Using gradient descent to minimize the loss function.

    gradient_descent parameter_update
  • What is the formula for updating parameters in gradient descent?

    \( \theta_i^{(t+1)} = \theta_i^{(t)} - \alpha \frac{\partial E}{\partial \theta_i^{(t)}} \)

    gradient_descent formula
  • What type of task is a regression task?

    Predicting a continuous variable, like velocity or price.

    regression continuous_variable
  • What is the goal of a regression task?

    To predict a continuous variable.

    regression prediction
  • What is an example of a regression task?

    Predicting the price of a house.

    regression example
  • What activation function is often used in the output layer for regression?

    Linear activation.

    activation regression
  • What loss function is commonly used in regression?

    Mean Squared Error (MSE).

    loss regression
  • What is the formula for Mean Squared Error (MSE)?

    MSE = \frac{1}{N} \sum_{i=1}^{N}(\hat{y}_i - y_i)^2

    loss mse
  • What does MSE equal when predictions are correct?

    0.

    mse accuracy
  • What is the primary goal of classification tasks?

    To choose between different categories or discrete options.

    classification tasks
  • What is binary classification?

    Classification with only 2 possible classes.

    classification binary
  • What is multi-class classification?

    Classification with more than 1 class, where each input belongs to exactly 1 class.

    classification multi-class
  • What is multi-label classification?

    Each input can belong to multiple classes.

    classification multi-label
  • What is the loss function used in classification?

    Cross-entropy.

    loss classification
  • What do we want to maximize in classification?

    The likelihood of the network assigning correct labels.

    classification likelihood
  • What is the probability formulation for binary classification?

    \prod_{i=1}^{N} (\hat{y}^{(i)})^{y^{(i)}} (1 - \hat{y}^{(i)})^{(1-y^{(i)})}

    binary classification
  • What happens if the network assigns the correct label for every data point?

    The product approaches 1.

    classification accuracy
  • What is the issue with multiplying probabilities in classification?

    It can lead to underflow errors.

    classification underflow
  • How can we avoid underflow errors in classification?

    By maximizing the logarithm of the probability formula.

    classification underflow
  • What is the formula for binary cross-entropy loss?

    -\sum_{i=1}^{N} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]

    loss binary_cross-entropy
  • What is the formula for binary cross-entropy loss?

    \( L = -\frac{1}{N} \sum_{i=1}^{N} [y(i) \log(\hat{y}(i)) + (1 - y(i)) \log(1 - \hat{y}(i))] \)

    loss binary cross-entropy
  • What does normalizing by the number of data points do in loss calculation?

    It makes the loss magnitude independent of the number of data points.

    normalization loss
  • What is categorical cross-entropy?

    It generalizes binary cross-entropy for multiple classes.

    loss categorical cross-entropy
  • What is the formula for categorical cross-entropy loss?

    \( L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_c(i) \log(\hat{y}_c(i)) \)

    loss categorical cross-entropy
  • In categorical cross-entropy, what does y_c represent?

    1 if C is the correct class for data point i, 0 otherwise.

    classification categorical cross-entropy
  • What is the output layer configuration for a multi-class classification neural network example?

    An output layer with 3 neurons predicting probabilities over 3 flower types.

    neural_networks multi-class classification
  • What activation function is commonly used with categorical cross-entropy loss?

    Softmax activation.

    activation softmax loss
  • What is batching in neural networks?

    Combining vectors of several data points into a matrix for simultaneous processing.

    batching neural_networks
  • Why is batching beneficial for training neural networks?

    It increases speed and reduces noise, leveraging GPU efficiency.

    batching training efficiency
  • What does batching allow GPUs to do more efficiently?

    Perform matrix multiplications in parallel.

    gpus efficiency batching
  • How does batching assist in regularization during optimization?

    It combines updates from several data points easily.

    regularization optimization batching
  • What is the benefit of batching in neural networks?

    Combines updates from several datapoints, making updates more stable and accurate.

    neural_networks batching
  • What is the input matrix X in a neural network?

    A batch of data points with dimensions n x k, where n is data points and k is feature vectors.

    neural_networks input_matrix
  • What does the first layer in a neural network apply?

    A linear transformation using a weight matrix and adding a bias.

    neural_networks first_layer
  • What is Z in the context of a neural network?

    The output matrix after applying the weight matrix and bias in the first layer.

    neural_networks output_matrix
  • What do we get after applying the activation function to Z?

    A, the output of the first hidden layer.

    neural_networks activation_function
  • What is the purpose of calculating loss in a neural network?

    To determine how well the model performs.

    neural_networks loss
  • What method is used to update model parameters in neural networks?

    Gradient descent is used to update weight matrices and biases.

    neural_networks gradient_descent
  • What is backpropagation in neural networks?

    A method to calculate necessary partial derivatives iteratively.

    neural_networks backpropagation
  • How does backpropagation simplify calculations?

    It breaks down calculations into smaller steps, moving backwards through the network.

    neural_networks calculations
  • What is the chain rule used for in neural networks?

    To calculate the derivative of a composite function.

    neural_networks chain_rule
  • What does the chain rule formula represent?

    It shows how to break down derivatives into smaller parts for easier calculation.

    neural_networks chain_rule
  • How can we find the partial derivative of the loss with respect to W[1]?

    By breaking it down through Z[1] and A[1] using their respective derivatives.

    neural_networks partial_derivatives
  • What are the two types of partial derivatives in backpropagation?

    The output of an activation function w.r.t its input and the output of a linear transformation w.r.t its input.

    neural_networks partial_derivatives
  • What is the purpose of the partial derivative in backpropagation?

    To update the weights of the linear transformation in the neural network.

    backpropagation neuralnetworks
  • What does the partial derivative of a matrix w.r.t another matrix represent?

    A 4-D tensor containing the partial derivatives of every element in the first matrix w.r.t every element in the second.

    mathematics tensor
  • What is the linear transformation notation used in backpropagation?

    Z = XW, where Z is the output, X is input, and W is weights.

    backpropagation notation
  • What do you need to calculate to update weights in a linear transformation?

    The partial derivative of the loss w.r.t the weights and the bias vector.

    backpropagation weights
  • What is the shape of the partial derivative of a scalar w.r.t a matrix?

    It has the same shape as the original matrix itself.

    mathematics derivatives
  • What is the key component in the derivatives during backpropagation?

    The partial derivative of the loss w.r.t the output of the linear transformation.

    backpropagation loss
  • What does backpropagation iteratively calculate?

    Partial derivatives, taking them from the top layers and passing them down.

    backpropagation calculation
  • What is the bias vector used for in backpropagation?

    It is repeated for each neuron in the layer to add the same bias to each input vector.

    backpropagation bias
  • What is necessary for lower levels to calculate their own partial derivatives?

    The gradient of the loss w.r.t the input and the weight's partial derivative.

    backpropagation gradient
  • What rule is used to break down the calculations in backpropagation?

    The chain rule.

    backpropagation chainrule
  • What is the significance of the dimensions N, D, and M in backpropagation?

    They represent the number of inputs, dimensions, and outputs respectively.

    backpropagation dimensions
  • What happens during the forward pass in a neural network?

    The operation takes X and W as inputs and produces output Z.

    neuralnetworks forwardpass
  • What does the partial derivative of the loss w.r.t one element depend on?

    It depends on the weights it multiplies with and the loss of whatever uses this element.

    backpropagation loss derivative
  • How many output values does the particular element affect?

    It affects exactly 3 output values: z1,1, z1,2, and z1,3.

    neural_network output_values
  • What is the equation for the partial derivative of the element?

    The equation uses the chain rule and involves the weight w1,1 and the partial derivative of z1,1 w.r.t x1,1.

    chain_rule equation
  • What happens when you calculate the partial derivative w.r.t the full matrix X?

    It can be expressed as a dot product of two matrices.

    matrix partial_derivative
  • What do the two matrices in the dot product represent?

    The first is the partial derivative of the loss w.r.t Z, and the second is the transposed weight matrix for the layer.

    dot_product matrices
  • What is the importance of backpropagation for inputs X?

    It is a simple way of calculating backpropagation for inputs in a given layer.

    backpropagation inputs
  • How do we calculate the partial derivative w.r.t the weights?

    By breaking it down for one individual weight, considering its effect on the output.

    weights partial_derivative
  • What does one weight affect in the output?

    One weight affects two values in the output for two data points in the batch.

    weights output
  • What is the equation for the partial derivative of the loss w.r.t the weights?

    It is a dot product of the partial derivative of the loss w.r.t Z and the transposed matrix of features XT.

    weights dot_product
  • What do we need to calculate for the bias vector?

    The partial derivative of the loss w.r.t the bias vector.

    bias partial_derivative
  • What result do we get for the partial derivative of the loss w.r.t the bias?

    It is equal to a transposed column vector of 1s times the partial derivative of the loss w.r.t z.

    bias loss
  • What is needed to perform full backpropagation through the neural network?

    How to handle the activation functions.

    backpropagation activation_functions
  • How are activation functions generally applied?

    They are applied element-wise.

    activation_functions element-wise
  • What is the purpose of activation functions in a neural network?

    Activation functions are applied element-wise to introduce non-linearity, allowing the network to learn complex patterns.

    neural_networks activation_functions
  • Do activation functions have parameters that need updating during training?

    No, activation functions generally do not have parameters that need to be updated during training.

    neural_networks activation_functions
  • What is the derivative of an activation function denoted as?

    The derivative of an activation function is denoted as g′(x).

    neural_networks activation_functions
  • What does the chain rule help with in back propagation?

    The chain rule helps calculate the partial derivative of the loss with respect to the inputs of the activation function.

    neural_networks backpropagation
  • What is the derivative of the Linear activation function?

    For Linear: g(z) = z, g′(z) = 1.

    activation_functions linear
  • What is the formula for the Sigmoid activation function?

    For Sigmoid: g(z) = 1/(1 + e^(-z)), g′(z) = g(z)(1 - g(z)).

    activation_functions sigmoid
  • What is the formula for the Tanh activation function?

    For Tanh: g(z) = (e^z - e^(-z))/(e^z + e^(-z)), g′(z) = 1 - g(z)².

    activation_functions tanh
  • What is the ReLU activation function and its derivative?

    For ReLU: g(z) = z for z > 0, 0 for z ≤ 0; g′(z) = 1 for z > 0, 0 for z ≤ 0.

    activation_functions relu
  • How is Softmax different from other activation functions?

    Softmax takes a whole vector as input and outputs a whole vector, unlike other activation functions applied element-wise.

    activation_functions softmax
  • What is the purpose of combining Softmax with cross-entropy?

    Combining Softmax with cross-entropy simplifies the backpropagation of derivatives for classification tasks.

    neural_networks softmax cross_entropy
  • What does the joint partial derivative through Softmax and cross-entropy represent?

    It represents the predictions minus the true class labels, normalized by N if applicable.

    neural_networks softmax cross_entropy
  • What is gradient descent?

    Gradient descent is an optimization algorithm that updates parameters by taking small steps in the negative direction of the gradient.

    optimization gradient_descent
  • What is the formula for updating weights in gradient descent?

    W_new = W_old - α * (∂L/∂W), where α is the learning rate.

    optimization gradient_descent
  • What is the learning rate in gradient descent?

    The learning rate (α) is a hyperparameter that determines the step size for updating model parameters.

    optimization learning_rate
  • What is the formula for updating weights in gradient descent?

    𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝛼 \( \frac{\partial L}{\partial W} \)

    gradient_descent formula
  • What does α represent in gradient descent?

    Learning rate/step size, a hyperparameter based on the development set.

    hyperparameter learning_rate
  • What must be true for gradients to be computed in neural networks?

    Network functions and the loss need to be differentiable.

    neural_networks gradients
  • What is the first step in the general algorithm for gradient descent?

    Initialise weights randomly.

    algorithm gradient_descent
  • What is the termination condition in gradient descent?

    When the loss function does not improve anymore.

    termination gradient_descent
  • What is a common issue when updating weights during backpropagation?

    Updating weights before finishing using original weights can cause errors.

    backpropagation errors
  • What is Stochastic Gradient Descent (SGD)?

    Calculating the gradient based on one data point and updating weights immediately.

    sgd gradient_descent
  • What are the steps in Stochastic Gradient Descent?

    1. Initialise weights randomly.
    2. Loop until convergence: a) Loop over each datapoint: i) Compute gradient based on the datapoint. ii) Update weights.
    sgd algorithm
  • What is Mini-batched Gradient Descent?

    A balance between batch and stochastic gradient descent, using batches of data points.

    mini-batch gradient_descent
  • What are the steps in Mini-batched Gradient Descent?

    1. Initialise weights randomly.
    2. Loop until convergence: a) Loop over batches of data points: i) Compute gradient based on the batch. ii) Update weights.
    mini-batch algorithm
  • What is a challenge in optimising neural networks?

    Finding the lowest point on complex loss surfaces is difficult.

    optimisation neural_networks
  • Why is the learning rate important?

    The size of the learning rate significantly affects the training process.

    learning_rate importance
  • What happens if the learning rate is too low?

    Optimization can take a very long time to reach a good minimum.

    learning_rate optimization
  • What happens if the learning rate is too high?

    We can step over the correct solution.

    learning_rate optimization
  • What is the ideal state of the learning rate?

    It allows reaching the minimum of the loss function in a reasonable number of steps.

    learning_rate optimization
  • What is the learning rate?

    A hyperparameter that needs to be chosen based on the development set.

    learning_rate hyperparameter
  • What are adaptive learning rates?

    Different learning rates for each parameter in the model.

    adaptive_learning hyperparameter
  • What happens if a parameter has not been updated for a while?

    The learning rate for that parameter may be increased.

    adaptive_learning parameters
  • What happens if a parameter is making big updates?

    The learning rate for that parameter may be decreased.

    adaptive_learning parameters
  • What algorithms work well for adaptive learning rates?

    The 'Adam' and 'AdaDelta' algorithms.

    adaptive_learning algorithms
  • What is learning rate decay?

    Scaling the learning rate by a value between 0 and 1.

    learning_rate_decay hyperparameter
  • What is the intuition behind learning rate decay?

    Take smaller steps as we approach the minimum to avoid overshooting.

    learning_rate_decay optimization
  • When can learning rate decay be performed?

    Every epoch, after a certain number of epochs, or when validation performance doesn't improve.

    learning_rate_decay strategies
  • What is the simplest approach to weight initialization?

    Setting weights to zeros.

    weight_initialization neural_networks
  • Why should we not set all weights to zero?

    Neurons will learn the same things, leading to the same optimized values.

    weight_initialization neural_networks
  • What is a common method for weight initialization?

    Drawing randomly from a normal distribution with mean 0 and variance 1 or 0.1.

    weight_initialization normal_distribution
  • What does Xavier Glorot initialization do?

    Draws values from a uniform distribution based on the number of neurons in layers.

    weight_initialization xavier_glorot
  • What is the formula used in Xavier Glorot initialization?

    Weights are drawn from a uniform distribution defined by boundaries involving the number of neurons.

    weight_initialization xavier_glorot
  • What role does randomness play in neural networks?

    It is important for various aspects of the learning process.

    randomness neural_networks
  • What role does randomness play in neural networks?

    Different random initialisations lead to different results and performance.

    neural_networks randomness
  • What is the solution to controlling randomness in neural networks?

    Explicitly set the random seed for all random number generators used.

    neural_networks randomness
  • What can happen when processes are parallelised on GPUs?

    They can produce randomly different results due to different threads running at different times.

    neural_networks gpus
  • How should you report model performance under different random seeds?

    Report the mean and standard deviation of the performance.

    neural_networks performance
  • What is min-max normalisation?

    Scaling the smallest value to a and the largest to b, e.g., [0, 1] or [-1, 1].

    normalisation data_processing
  • What is the formula for min-max normalisation?

    X′ = a + (X - Xmin)(b - a) / (Xmax - Xmin)

    normalisation formulas
  • What is standardisation (z-normalisation)?

    Scaling the data to have mean 0 and standard deviation 1.

    normalisation data_processing
  • What is the formula for standardisation?

    X′ = (X - μ) / σ

    normalisation formulas
  • Why is normalisation important in neural networks?

    It helps weight updates to be proportional to the input, improving model learning accuracy.

    normalisation neural_networks
  • What should you remember about normalisation for data columns?

    Normalise each column separately, not the entire matrix.

    normalisation data_processing
  • How should normalising constants be calculated?

    Calculate them based only on the training set and apply them to test/evaluation sets.

    normalisation data_processing
  • What is gradient checking?

    A method to verify if the gradient is calculated correctly in the implementation.

    gradient_checking neural_networks
  • What are the two methods to isolate the gradient?

    1. Check gradient using weight difference before and after gradient descent.
    2. Measure change in loss by altering the weight slightly.
    gradient_checking neural_networks
  • What is the formula for the gradient using weight difference?

    ∂L(w)/∂w = (w(t-1) - w(t)) / α

    gradient_checking formulas
  • What is the formula for measuring change in loss?

    ∂L(w)/∂w ≈ (L(w + ε) - L(w - ε)) / (2ε)

    gradient_checking formulas
  • What is the definition of a partial derivative?

    The partial derivative of L(w) with respect to w is defined as: \( \frac{\partial L(w)}{\partial w} = \lim_{\epsilon \to 0} \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon} \)

    calculus derivative
  • What indicates a bug in neural network training?

    If the values from different methods of calculating partial derivatives are not similar, it indicates a bug.

    neural_networks debugging
  • What is overfitting in neural networks?

    Overfitting occurs when a model learns the training data too well, failing to generalize to unseen data.

    neural_networks overfitting
  • How can overfitting be prevented?

    To prevent overfitting, use held-out validation and test sets to measure generalization performance.

    neural_networks overfitting validation
  • What is network capacity?

    Network capacity refers to the number of parameters in a model and its ability to overfit the dataset.

    neural_networks capacity
  • What does it mean if a model is underfitting?

    Underfitting means the model performs poorly on both training and validation sets due to insufficient capacity.

    neural_networks underfitting
  • How can you improve a model that is underfitting?

    Increase the number of neurons, parameters, or layers in the model to improve learning.

    neural_networks underfitting
  • What indicates a model is overfitting?

    Overfitting is indicated by good performance on the training set but poor performance on the validation set.

    neural_networks overfitting
  • What is one method to prevent overfitting?

    Limit the number of parameters in the model to prevent memorization of the dataset.

    neural_networks overfitting prevention
  • What is the best solution to overfitting?

    The best solution to overfitting is to acquire more data for training.

    neural_networks overfitting data
  • What is early stopping in neural network training?

    Early stopping is a method where training is halted when performance on the validation set does not improve for a set number of epochs.

    neural_networks early_stopping
  • What is regularization in the context of neural networks?

    Regularization adds constraints to the model to prevent overfitting, such as penalizing large weights.

    neural_networks regularization
  • What are L2 and L1 regularization?

    L2 regularization adds squared weights to the loss function, while L1 regularization adds absolute weights, both helping to control model complexity.

    neural_networks regularization l2 l1
  • What does L2 regularization do to weights?

    L2 regularization penalizes larger weights more, encouraging sharing between features and pushing weights towards 0.

    neural_networks regularization l2
  • What does L2 regularisation do?

    Adds squared weights to the loss function, penalising larger weights more and encouraging sharing between features.

    regularisation l2
  • What is the formula for L2 regularisation loss function?

    The formula is: 𝐽(𝜃) = 𝐿𝑜𝑠𝑠(𝑦, ŷ) + 𝜆 ∑𝑤^2

    formula l2
  • How does L2 regularisation affect weight updates?

    The update rule is: 𝑤 ← 𝑤 − 𝛼(𝜕𝐿𝑜𝑠𝑠/𝜕𝑤 + 2𝜆𝑤)

    weight_update l2
  • What is the role of the hyperparameter λ in L2 regularisation?

    Controls the importance of regularisation, usually set to a low value (e.g., 0.001).

    hyperparameter l2
  • What does L1 regularisation do?

    Adds the absolute value of weights to the loss function, using the sign of the weight for updates.

    regularisation l1
  • What is the formula for L1 regularisation loss function?

    The formula is: 𝐽(𝜃) = 𝐿𝑜𝑠𝑠(𝑦, ŷ) + 𝜆 ∑|𝑤|

    formula l1
  • How does L1 regularisation affect weight updates?

    The update rule is: 𝑤 ← 𝑤 − 𝛼(𝜕𝐿𝑜𝑠𝑠/𝜕𝑤 + 𝜆 𝑠𝑖𝑔𝑛(𝑤))

    weight_update l1
  • How do L1 and L2 regularisation differ in weight management?

    L2 pushes all weights towards 0, while L1 encourages sparsity, keeping many weights at 0.

    comparison regularisation
  • What is dropout in neural networks?

    A method to reduce overfitting by randomly setting some neural activations to 0 during training.

    dropout overfitting
  • What percentage of neurons are typically dropped during training with dropout?

    About 50% of neurons are typically dropped at each backward pass.

    dropout neural_networks
  • What happens during testing when using dropout?

    All neurons are used, but inputs are scaled to match training expectations.

    dropout testing
  • What is the difference between supervised and unsupervised learning?

    Supervised learning uses labeled data, while unsupervised learning uses only feature values without labels.

    learning supervised unsupervised
  • What is the objective of unsupervised learning?

    To find hidden structures in the dataset without ground-truth labels.

    unsupervised objective
  • What is unsupervised learning?

    A type of learning where the dataset consists only of feature values without ground-truth labels.

    machine_learning unsupervised_learning
  • What is the objective of unsupervised learning?

    To find hidden structures in the dataset for making inferences or decisions.

    machine_learning objectives
  • What is clustering in unsupervised learning?

    The task of finding groups ('clusters') of samples that might belong to the same class.

    machine_learning clustering
  • What is density estimation?

    Finding the probability of seeing a point in a certain location compared to another location.

    machine_learning density_estimation
  • What is dimensionality reduction?

    A process to reduce the number of features while retaining important information.

    machine_learning dimensionality_reduction
  • Name a famous algorithm for dimensionality reduction.

    Principal Component Analysis (PCA).

    machine_learning algorithms
  • What does clustering imply about intra-cluster variance?

    There is low intra-cluster variance among instances in the same cluster.

    machine_learning clustering
  • What is the k-means algorithm used for?

    To identify a specified number of clusters in a dataset.

    machine_learning k-means
  • What are the steps of the k-means algorithm?

    Initialisation, Assignment, Update, and checking for convergence.

    machine_learning k-means process
  • What is a cluster in clustering?

    A set of instances that are similar to each other and dissimilar to instances in other clusters.

    machine_learning clustering
  • How does clustering help in vector quantization?

    It improves encoding by clustering information in a datastream to reduce data size.

    machine_learning vector_quantization
  • What is an example of using clustering in nature?

    Identifying different species of flowers by plotting features like petal length vs. sepal width.

    machine_learning nature clustering
  • What is the structure of an unsupervised learning task?

    A feature space with datapoints lacking additional information like labels or values.

    machine_learning unsupervised_learning
  • What does k represent in k-means clustering?

    The number of clusters, e.g., k = 3 means there are 3 centroids.

    k_means clustering
  • What is the first step in the k-means algorithm?

    Initialisation: Select k random instances or generate random vectors for centroids.

    k_means initialisation
  • What is the goal of the assignment step in k-means?

    Assign every point in the dataset to the nearest centroid.

    k_means assignment
  • How do we update centroids in k-means?

    By computing the average position of all points in each cluster.

    k_means update
  • What is checked during the convergence step in k-means?

    The displacement of centroids; if it's larger than a threshold, loop back to assignment.

    k_means convergence
  • What are Voronoi diagrams?

    Diagrams that create decision boundaries equidistant between centroids.

    geometry voronoi
  • What is the formula for the assignment step in k-means?

    orall i ext{ in } ext{{1,…,N}} ext{ } c(i) = ext{argmin}_{k ext{ in } ext{{1,…,K}}} ext{ } orm{x(i) - oldsymbol{ u}_k}^2.

    k_means assignment_formula
  • What does the update formula in k-means compute?

    The average location for all samples assigned to cluster k.

    k_means update_formula
  • What condition indicates convergence in k-means?

    If orall k ext{ } |oldsymbol{ u}_k^t - oldsymbol{ u}_k^{t-1}| < oldsymbol{ ext{ε}}.

    k_means convergence_condition
  • What is checked in Step 4 of K-means?

    Convergence by computing the movement of centroids between timesteps.

    k-means convergence
  • What indicates to stop iterating in K-means?

    If the movement of centroids is lower than a certain threshold (𝜖).

    k-means iteration
  • How is K-means viewed as a model?

    As a model optimization problem with centroid locations and data point assignments.

    k-means model
  • What is the objective of K-means?

    Minimize the loss function L for assignments of data points to centroids.

    k-means objective
  • What does the loss function L represent?

    The mean distance between samples and their associated centroid.

    k-means loss_function
  • What is the significance of K in K-means?

    K is a crucial hyperparameter that affects the clustering results.

    k-means hyperparameter
  • What is the Elbow Method used for?

    To determine the optimal value of K by plotting loss values against K.

    k-means elbow_method
  • What should be selected according to the Elbow Method?

    The value of K where the rate of decrease in loss sharply shifts.

    k-means elbow_method
  • What does cross-validation help determine?

    The best value for hyperparameters using a validation set.

    k-means cross-validation
  • What are the strengths of K-means?

    Simple, popular, and efficient with linear complexity.

    k-means strengths
  • What is a significant weakness of K-means?

    The need to define K, which significantly impacts results.

    k-means weaknesses
  • What is a significant hyperparameter in K-means?

    K (the number of clusters)

    k-means hyperparameter
  • What is a weakness of K-means regarding its results?

    It only finds a local optimum and is sensitive to initial centroid positions.

    k-means weaknesses
  • What technique can improve K-means initialization?

    K-means++

    k-means initialization
  • When is K-means applicable?

    When a distance function exists on the dataset, typically with real values.

    k-means applicability
  • What algorithm works with categorical data in clustering?

    K-mode algorithm

    k-mode categorical
  • How does K-medioid algorithm differ from K-means?

    It is less sensitive to outliers by using the geometric median.

    k-medioid outliers
  • What shape must clusters have for K-means to work effectively?

    Clusters must be hyper-ellipsoids (or hyper-spheres).

    k-means cluster_shapes
  • What is the objective of density estimation algorithms?

    To estimate the probability density function p(x) from data.

    density_estimation pdf
  • What does a Probability Density Function (PDF) model?

    The likelihood of a continuous variable being observed within an interval.

    pdf probability
  • What must the integral of a PDF over its range equal?

    1

    pdf integral
  • What is one application of density estimation?

    Anomaly/novelty detection.

    density_estimation applications
  • What is the goal of generative models in relation to probability?

    To model the distribution of a class as p(X | y).

    generative_models probability
  • What do discriminative models directly model?

    The probability of observing label y given sample values X, p(y | X).

    discriminative_models probability
  • What activation function transforms neural network output into a probability distribution?

    Softmax activation.

    neural_networks softmax
  • What does the Softmax activation do?

    Transforms the output of the neural network into a probability distribution.

    neural_networks activation_functions
  • What is Bayes’ rule used for in generative models?

    To turn the generative model into a discriminative classifier.

    bayes classification
  • What is the formula for Bayes’ rule?

    \( p(y | X) = \frac{p(X | y)p(y)}{p(X)} \)

    bayes formula
  • What do non-parametric methods assume about function shape?

    They make no assumptions about the form/shape of the function.

    non-parametric methods
  • What is an example of a non-parametric method?

    k-NN algorithm.

    k-nn non-parametric
  • What is the bias and variance characteristic of non-parametric methods?

    Low bias; high variance depending on the data.

    bias variance
  • What do histograms do in density estimation?

    Group data into bins, count occurrences, and normalize.

    density_estimation histograms
  • What does normalization ensure in histograms?

    The integral of the function sums to 1, making it a valid PDF.

    normalization pdf
  • What is Kernel Density Estimation?

    Estimates the density of a function by using a kernel around training examples.

    kernel_density_estimation density_estimation
  • What does the kernel function do in density estimation?

    Computes the difference with the current point x and normalizes according to bandwidth.

    kernel_function density_estimation
  • What is a Parzen window?

    A method used in kernel density estimation to define the kernel.

    parzen_window kernel
  • What type of distribution can be used as a kernel in density estimation?

    Gaussian distribution.

    gaussian kernel
  • What are the characteristics of parametric approaches?

    Make assumptions about the shape, inducing bias but fixing the number of parameters.

    parametric bias
  • What is the univariate Gaussian distribution parameterized by?

    Mean (μ) and variance (σ).

    univariate gaussian
  • What is ensured by the normalization factor in Gaussian distribution?

    The integral of the distribution sums to 1.

    normalization gaussian
  • What does the multivariate Gaussian distribution take as input?

    A multi-dimensional vector.

    multivariate gaussian
  • What is the input of the Multivariate Gaussian Distribution?

    A multi-dimensional vector.

    statistics gaussian
  • What replaces variance in the Multivariate Gaussian Distribution?

    The covariance matrix Σ.

    statistics gaussian
  • What is the purpose of the normalization term in the Multivariate Gaussian Distribution?

    To ensure the double-integral sums to 1.

    statistics normalization
  • What does likelihood determine in a model?

    How good the model is at capturing the probability of generating data x.

    statistics likelihood
  • What assumption is made about the datapoints in the training set?

    They follow i.i.d distributions.

    statistics data
  • What do we multiply to get the likelihood in a dataset?

    The predicted values from the models for every sample with parameters θ.

    statistics likelihood
  • Why do we calculate negative log-likelihood instead of likelihood?

    To turn maximization into minimization, similar to training a neural network.

    statistics optimization
  • What does Gaussian fitting minimize?

    The negative log likelihood.

    statistics gaussian fitting
  • What happens when you take the log of a multiplication term?

    Multiplications turn into sums.

    mathematics logarithm
  • Is the Gaussian distribution sufficient for modeling densities in all cases?

    No, it may not be satisfactory for all data distributions.

    statistics gaussian
  • What is the problem with fitting a Gaussian distribution to bimodal data?

    It induces bias and may not capture the data's characteristics.

    statistics bias
  • What is a potential solution to the limitations of Gaussian distributions?

    Using mixture models to capture different modes of the distribution.

    statistics mixture_models
  • How is the PDF of mixture models defined?

    As the weighted sum of multiple PDFs: 𝑝(𝑥) = ∑ 𝜋𝑘𝑝𝑘(𝑥).

    statistics mixture_models
  • What constraints does the mixing proportion 𝜋𝑘 follow?

    0 ≤ 𝜋𝑘 ≤ 1 and ∑ 𝜋𝑘 = 1.

    statistics constraints
  • What does the Gaussian Mixture Model (GMM) estimate?

    The probability density with p(x) from multiple Gaussian distributions.

    statistics gmm
  • What is the Gaussian Mixture Model a weighted sum of?

    Gaussians, ensuring the PDF integrates to 1.

    statistics gmm density_estimation
  • What is a Gaussian Mixture Model (GMM)?

    A GMM is a weighted sum of Gaussians.

    gmm statistics
  • What does GMM ensure about the PDF?

    The GMM ensures that the PDF integrates to 1, even if it is a mixture of multiple PDFs.

    pdf gmm
  • What is the purpose of GMMs?

    GMMs can model complicated data, including multi-modal data.

    modeling data gmm
  • What algorithm is used to fit GMM to training examples?

    The Expectation Maximisation (EM) algorithm is used.

    em algorithm gmm
  • What are the two main steps of the EM algorithm?

    The two main steps are the E-step (expectation) and the M-step (maximisation).

    em steps
  • What is done in the E-step of the EM algorithm?

    Responsibilities for each training example and each mixture component are computed.

    e-step em
  • How is the responsibility calculated in the E-step?

    Using the formula: \( r_{ik} = \frac{\pi_k \mathcal{N}(x^{(i)} | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x^{(i)} | \mu_j, \Sigma_j)} \)

    responsibility e-step
  • What is updated in the M-step of the EM algorithm?

    GMM parameters are updated using the computed responsibilities.

    m-step em
  • How is the mean updated in the M-step?

    The mean is updated using: \( \mu_k = \frac{1}{N_k} \sum_{i=1}^{N} r_{ik} x^{(i)} \)

    mean m-step
  • What is checked for convergence in the EM algorithm?

    Convergence is checked by monitoring changes in parameters or log likelihood.

    convergence em
  • What is the Bayesian Information Criterion (BIC)?

    BIC is used to select the number of components K in GMM.

    bic gmm
  • What is the formula for BIC?

    \( BIC_k = \mathcal{L}(K) + \frac{P_k}{2} \log(N) \)

    bic formula
  • What does \( \mathcal{L}(K) \) represent in the BIC formula?

    \( \mathcal{L}(K) \) is the negative log likelihood.

    bic log_likelihood
  • What is the penalty term in the BIC formula?

    The penalty term is \( \frac{P_k}{2} \log(N) \), which penalizes complex models.

    bic penalty
  • What does N represent in the BIC formula?

    N is the number of examples in the dataset.

    bic n
  • What is the formula for Ck?

    Ck = ℒ(K) + Pk/(2 log(N))

    formula statistics
  • What does ℒ(K) represent?

    ℒ(K) is the negative log likelihood encouraging fitting of data.

    statistics likelihood
  • What does Pk/(2 log(N)) represent?

    It is the penalty term that penalizes complex models.

    statistics penalty
  • What does N represent in the context?

    N is the number of examples in the dataset.

    data statistics
  • What does Pk represent?

    Pk is the number of parameters.

    parameters statistics
  • How many parameters does a 2D Gaussian have?

    Pk = 6K - 1.

    gaussian parameters
  • What are the parameters for the mean in 2D Gaussian?

    2 parameters for the mean (2D vector).

    gaussian mean
  • How many parameters are needed for covariance in 2D Gaussian?

    3 parameters for the covariance (symmetric 2x2 matrix).

    gaussian covariance
  • What is the purpose of the -1 in the parameter count?

    It accounts for the constraint that the sum of mixing proportions must equal 1.

    parameters constraints
  • What principle is suggested for model selection?

    Occam’s Razor: pick the simplest model that fits.

    modeling occam'srazor
  • What happens to BIC values as K increases?

    BIC values decrease sharply then rise again due to penalty dominance.

    bic model_selection
  • What is cross-validation used for?

    To find the most appropriate number of components K.

    cross-validation model_selection
  • What are the steps in cross-validation for GMM-EM?

    1. Split into training/validation sets. 2. Run GMM-EM with different K's. 3. Pick K with best likelihood.
    cross-validation gmm-em
  • What is a key similarity between GMM and K-means?

    Both require selecting the most appropriate K value for clusters/components.

    gmm k-means
  • What does convergence mean in GMM and K-means?

    Convergence occurs when changes in parameters are sufficiently small.

    convergence algorithms
  • How does GMM initialization relate to K-means?

    GMM means are often initialized from K-means centroid locations.

    gmm initialization
  • What is soft clustering in GMM?

    Every point belongs to several clusters with varying degrees of membership.

    clustering gmm
  • What distance metric is used in GMM?

    Distance is related to Mahalanobis distance, encoded by the covariance matrix.

    distance gmm
  • What is the focus of Module 7?

    Evolutionary algorithms, including genetic algorithms.

    algorithms evolutionary
  • What is the purpose of genetic/evolutionary algorithms?

    Optimization for black box functions.

    optimization algorithms
  • What is a Genetic/Evolutionary Algorithm?

    An optimisation method for black box functions without knowing the mathematical equation or gradient, inspired by natural evolution and genetics.

    algorithms genetic optimization
  • What is reinforcement learning?

    Learning to maximize a numerical reward, considered an optimization problem.

    reinforcement learning optimization
  • What do traditional RL algorithms deal with?

    Discrete states and action spaces.

    reinforcement learning states
  • What do policy search algorithms deal with?

    Continuous search spaces represented as 𝑥∗ = argmax 𝑥 𝑓(𝑥).

    policy search algorithms
  • What are Black-Box Optimisation Algorithms?

    Algorithms where the links between parameters are unknown at the start of training.

    black-box optimization algorithms
  • What is an example of black-box optimisation in robotics?

    The unknown relationship between speed and joint movements.

    robotics black-box optimization
  • What year did Darwin publish his theory about the origin of species?

    1859.

    history darwin theory
  • What are the four main concepts of Darwin's theory?

    1. Variation is heritable. 2. Resources are finite. 3. Natural selection. 4. Survivors pass traits to offspring.
    darwin theory concepts
  • Who discovered principles of statistical inheritance?

    Mendel in 1866.

    mendel inheritance genetics
  • What did Weismann discover in 1883?

    Acquired traits are not passed to offspring.

    weismann inheritance genetics
  • What did Watson, Crick, and Franklin discover in 1953?

    The structure of DNA.

    dna discovery structure
  • What is a gene?

    A sequence of nucleotides in DNA that codes a particular trait.

    gene dna trait
  • What is a genotype?

    A set of genes (parameters).

    genotype genes parameters
  • What is a phenotype?

    The physiological expression of the genotype.

    phenotype genotype expression
  • What are the three main families of genetic algorithms proposed in the 60s?

    1. Evolutionary Strategies. 2. Evolutionary Programming. 3. Genetic Algorithms.
    algorithms evolutionary families
  • What is Genetic Programming?

    The evolution of programs, defining and computing a program as a tree.

    genetic programming evolution
  • What is the main concept of genetic/evolutionary algorithms?

    They have a population of solutions encoding genotypes, which are developed into phenotypes for evaluation.

    genetic algorithms concept
  • What happens to the worst-performing functions in genetic algorithms?

    They are removed (killed), and crossover mutation occurs with better-performing functions.

    genetic algorithms performance
  • What is observed in the black box function?

    The output helps to rank the phenotypes.

    algorithms output
  • What happens to the worst functions in the process?

    They can be removed (killed).

    algorithms selection
  • What is done to better performing functions?

    Cross-over mutation is applied to generate new solutions (offsprings).

    algorithms mutation
  • What is the result of repeating the evolutionary process?

    The solution converges to an optimal high-performing solution.

    algorithms convergence
  • What principle do these algorithms use as a base?

    A simplified version of Neo-Darwinism.

    algorithms principle
  • How is each solution represented in evolutionary algorithms?

    Each solution is represented by a genotype.

    algorithms genotype
  • What function measures the performance of phenotypes?

    A fitness function is used.

    algorithms fitness
  • What is the selection operator?

    It selects the solutions that will be reproduced.

    algorithms selection
  • What does the cross-over operator do?

    It mixes the parents’ genotype to create the offspring.

    algorithms crossover
  • What is the mutation operator?

    It applies variations to the genotype after reproduction.

    algorithms mutation
  • What is the genotype in a genetic algorithm?

    A binary string of fixed size (e.g., 01001010).

    algorithms genotype
  • What is the genotype in genetic programming?

    A program represented as a tree (often in LISP).

    algorithms genotype
  • What does the mutation in evolutionary strategies draw from?

    It draws from a Gaussian distribution.

    algorithms mutation
  • What term is used to describe the blurred lines between algorithm families?

    Evolutionary Algorithms.

    algorithms unification
  • What is the goal of the Mastermind game?

    Finding the secret combination of colors.

    games mastermind
  • How many colors can each piece have in Mastermind?

    Each piece can have 6 different colors.

    games mastermind
  • What is the fitness function for Mastermind?

    F(x) = p1 + 0.5*p2.

    games fitness
  • What does p1 represent in the Mastermind fitness function?

    The number of pieces with the right color and correct position.

    games fitness
  • What does p2 represent in the Mastermind fitness function?

    The number of pieces with the right color but wrong position.

    games fitness
  • What does p2 represent in the context of fitness functions?

    Number of pieces with the right colour but the wrong position.

    fitness p2
  • What is the formula for the fitness function F(x)?

    F(x) = p1 + 0.5*p2

    fitness formula
  • What is the goal of evolutionary algorithms regarding fitness functions?

    Maximize the fitness function.

    evolutionary algorithms
  • What is F(x) for solving the problem in this context?

    F(x) = 4

    fitness problem
  • What is the fitness function for teaching a robot to walk?

    F(x) = walking speed = travelled distance after a few seconds.

    robotics fitness
  • What is the fitness function for teaching a robot to throw an object?

    F(x) = distance(object, target).

    robotics fitness
  • What do genotype and phenotype represent in problem-solving?

    Potential solutions to the problem.

    genotype phenotype
  • What is the genotype for the Mastermind game?

    Binary string with N*3 bits.

    mastermind genotype
  • How is the phenotype created from the genotype in the Mastermind game?

    Aggregate bits 3 by 3, each trio becomes an integer.

    mastermind phenotype
  • What do integers correspond to in the Mastermind game?

    Different colours: (0=red, 1=yellow, 2=green, 3=blue…).

    mastermind colours
  • What is done with invalid genotypes in the Mastermind game?

    Assigned the lowest fitness value to reduce survival chance.

    mastermind fitness
  • What is the purpose of selection operators in evolutionary algorithms?

    Select parents for the next generation.

    selection evolutionary
  • What is a standard approach for selection in evolutionary algorithms?

    Biased roulette wheel.

    selection roulette
  • How does the biased roulette wheel process work?

    Individuals are selected based on their fitness proportion.

    selection roulette
  • What is the first step in the biased roulette wheel process?

    Compute the probability pi to select an individual.

    selection roulette
  • What is the alternative to the roulette wheel selection method?

    Tournament selection.

    selection tournament
  • What is elitism in evolutionary algorithms?

    Keeping a fraction of the best individuals in the new generation.

    elitism evolutionary
  • What fraction is usually fixed for elitism?

    10%.

    elitism percentage
  • What is the role of the crossover operator?

    Combine traits of the parents.

    crossover operators
  • What is a common method for crossover?

    Single-point crossover.

    crossover method
  • What is the role of the mutation operator?

    Explore nearby solutions in the local solution space.

    mutation operators
  • How is standard mutation on binary strings performed?

    Randomly generate a number for each bit; if lower than probability m, mutate.

    mutation binary
  • What is the first step in rd mutation on binary strings?

    Randomly generate a number between 0 and 1 for each bit of the genotype.

    mutation binary
  • What happens if the generated number is lower than probability m?

    The bit is flipped.

    mutation binary
  • What is m typically set to in rd mutation?

    1/(size of the genotype).

    mutation probability
  • What is the purpose of the specific mutation in the Mastermind problem?

    To swap groups of 3 bits in the genotype with probability m2.

    mastermind mutation
  • What is a common stopping criterion for evolutionary algorithms?

    When a specific fitness value is reached.

    stopping criteria
  • What fitness value indicates an optimal solution in the example?

    A fitness value of 4.

    fitness optimal
  • What is another stopping criterion besides reaching a fitness value?

    After a pre-defined number of generations/evaluations.

    stopping criteria
  • What is the first step in the evolutionary algorithm flowchart?

    Randomly generate the population.

    flowchart evolutionary
  • What do we do after evaluating the population in the evolutionary loop?

    Select individuals to keep for the next generation.

    selection evolutionary
  • What is elitism in the context of evolutionary algorithms?

    Keeping a few parents in the new population.

    elitism evolutionary
  • What is the function used to evaluate fitness in Mastermind?

    F(x) = p1 + 0.5 p2.

    fitness mastermind
  • What are evolutionary strategies designed to optimize?

    Real values in problems.

    evolutionary real_values
  • What is the main difference between genetic algorithms and evolutionary strategies?

    Genotype: genetic algorithms use binary strings, evolutionary strategies use real values.

    genetic evolutionary
  • What does the μ + λ evolutionary strategy represent?

    Maintains a steady population of μ + λ individuals.

    evolutionary strategy
  • What is the first step in the μ + λ evolutionary strategy?

    Randomly generate a population of (μ + λ) individuals.

    μ+λ evolutionary
  • What is the selection process in the μ + λ strategy?

    Select the μ best individuals from the population as parents.

    selection μ+λ
  • What is the first step in the evolutionary strategy process?

    Randomly generate a population of (μ + λ) individuals.

    evolution strategy
  • What do you do after generating the population?

    Evaluate the population.

    evaluation population
  • How many best individuals are selected as parents?

    Select the μ best individuals from the population as parents (called x).

    selection parents
  • What is generated from the parents in the evolutionary strategy?

    Generate λ offsprings (called y) from the parents.

    offspring generation
  • What is the formula for generating offspring?

    For offspring, use the formula: 𝑦𝑖 = 𝑥𝑗 + ℵ(0, 𝜎) where j = random individual in μ.

    formula offspring
  • How is the population defined in the evolutionary strategy?

    Population = union of parents and offspring: population = (∪𝑖 𝜆 𝑦𝑖) ∪ (∪𝑗 𝜇 𝑥𝑗).

    population union
  • What is the main challenge in evolutionary strategies?

    The main challenge comes in fixing the hyperparameter 𝜎.

    challenge hyperparameter
  • What happens if 𝜎 is too large?

    If 𝜎 is too large, the population moves quickly to the solution but struggles to refine it.

    sigma population
  • What happens if 𝜎 is too small?

    If 𝜎 is too small, the population moves slowly and might be affected by local optima.

    sigma local_optima
  • How can 𝜎 be adjusted over time?

    Change 𝜎's value over time to adapt to the situation by adding sigma into the genotype.

    adaptation genotype
  • What is the new genotype defined as?

    Define another genotype as xj’ = {xj, σj} composed of the initial genotype and sigma value.

    genotype definition
  • How is the new offspring's sigma calculated?

    Calculate 𝜎𝑖 = 𝜎𝑗 exp(𝜏0ℵ(0, 1)).

    sigma calculation
  • What does the learning rate depend on?

    The learning rate 𝜏0 is proportional to 1/√𝑛, where n is the number of dimensions of the genotype.

    learning_rate dimensions
  • Why is substituting 𝜎 with 𝜏0 beneficial?

    The selection of 𝜏0 is less critical than the value of 𝜎, allowing more flexibility in setting it.

    substitution flexibility
  • What is a variant of evolutionary strategies?

    CMA-ES algorithm, which evolves a covariance matrix.

    cma-es variant
  • What is an approach to genetic algorithms?

    Discretise the parameters and use binary strings.

    genetic_algorithms discretisation
  • What is the goal of taking inspiration from natural evolution?

    To find effective solutions for survival and adaptation in environments.

    natural_evolution adaptation
  • What is the purpose of novelty search?

    To use novelty instead of fitness value to drive the search for optimality.

    novelty search
  • What does the rch algorithm focus on instead of fitness?

    Novelty value

    algorithm novelty
  • What is the purpose of the novelty archive?

    To store all encountered solutions for novelty calculation

    novelty archive
  • How is novelty calculated?

    By summing distances to the k nearest neighbors (k=3)

    novelty calculation
  • What does a larger novelty indicate?

    More difference from previous solutions

    novelty differences
  • What does the behavioral descriptor characterize?

    Aspects of solutions and distances between them

    behavior descriptor
  • Why is the behavioral descriptor task-specific?

    It defines features to compare based on the task

    task descriptor
  • What can happen if a feature is ignored in the behavioral descriptor?

    Loss of potentially useful information

    information descriptor
  • What is an example of a behavioral descriptor for a robot?

    (x, y) coordinates of the robot's final position

    robot descriptor
  • What problem can a fitness-focused algorithm encounter?

    Getting stuck in local minima

    algorithm fitness
  • How does novelty search differ from traditional evolutionary algorithms?

    It uses novelty score instead of fitness for evaluation

    novelty evaluation
  • What is the goal of Quality-Diversity Optimization?

    To learn diverse and high-performing solutions in one process

    optimization quality
  • What does the concept of Quality-Diversity Optimization apply to?

    Real-valued search space

    quality diversity
  • What is a potential benefit of novelty search for a bipedal robot?

    Leads to a more stable and successful robot

    robot stability
  • What is the goal of high-dimensional hyperspace exploration?

    To find points that lead to the most interesting solutions.

    hyperspace exploration
  • What does the concept of behavioural descriptors help generate?

    A collection of high-performing solutions with high diversity and performance.

    behavioural_descriptors solutions
  • How many degrees of freedom does the robot in the example have?

    12 degrees of freedom (2 in each leg).

    robot degrees_of_freedom
  • How many real-valued dimensions are there for the robot's movement?

    36 real-valued dimensions.

    dimensions robot_movement
  • What is the behavioural descriptor for the robot's movement?

    Proportion of time each leg touches the ground (6 dimensions).

    behavioural_descriptor robot
  • What is the goal of varying the proportions of time each leg spends touching the ground?

    To find an optimal solution for walking as fast as possible.

    robot walking optimization
  • How many ways to walk were found using the MAP-Elites algorithm?

    Over 13,000 ways to walk.

    map-elites walking
  • What are the two main focuses of Quality-Diversity (QD) algorithms?

    Measuring performance of solutions and distinguishing different types of solutions.

    qd_algorithms performance diversity
  • What is a fitness function used for in QD algorithms?

    To measure the performance of solutions.

    fitness_function qd_algorithms
  • What does the behavioural descriptor characterize in QD algorithms?

    It distinguishes different types of solutions.

    behavioural_descriptor qd_algorithms
  • What does Novelty Search with Local Competition optimize?

    Two fitness functions: novelty score and local competition.

    novelty_search local_competition
  • What is the concept of Local Competition in QD algorithms?

    Comparing new solutions only with similar ones in the same categories.

    local_competition comparison
  • What does LC(x) represent in Local Competition?

    Number of solutions that x outperforms within its k nearest neighbours.

    local_competition performance
  • What happens when a better version of a solution is found in the archive?

    The worse version is replaced by the better one.

    archive solution_replacement
  • What is the goal of MAP-Elites?

    To discretise the behavioural descriptor space in a grid and fill it with the best solutions.

    map-elites qd_algorithms
  • What does MAP-Elites stand for?

    Multi-Dimensional Archive of Phenotypic Elites.

    map-elites acronyms
  • What is the main advantage of MAP-Elites?

    Easy to implement and performs well in general.

    advantages map-elites
  • What is a disadvantage of MAP-Elites?

    Density of the solution is not always uniform.

    disadvantages map-elites
  • How does MAP-Elites add new solutions?

    If the cell is empty, the new solution is added; if occupied, the best fitness solution is kept.

    map-elites addition_mechanism
  • What is the hyper-parameter in MAP-Elites?

    Size of the cells (resolution of the grid).

    hyper-parameter map-elites
  • What is the first step in the MAP-Elites process?

    Randomly initialise some solutions to place in the grid.

    map-elites process
  • What happens during the mutation operator in MAP-Elites?

    Gaussian noise is added to some/all values of the selected solution.

    map-elites mutation
  • What is a common metric for diversity in MAP-Elites?

    Archive size (number of solutions stored in the collection).

    metrics map-elites
  • What does the QD-score represent?

    The sum of the fitness of all solutions in the archive.

    qd-score metrics
  • What is the trade-off in QD algorithms represented by?

    A Pareto-front to define the best variant of the algorithm.

    trade-off qd_algorithms
  • What is the usual metric for performance in MAP-Elites?

    Max or mean fitness value of all solutions.

    performance map-elites
  • What does the coverage refer to in MAP-Elites?

    Number of filled cells, number of individuals, or % of filled cells in the grid.

    coverage map-elites
  • What is the purpose of local competition in the algorithm?

    To explore many different solutions in the entire space.

    local_competition algorithm
  • What is the addition mechanism in MAP-Elites?

    It determines how new solutions are added to the grid based on their fitness.

    addition_mechanism map-elites
  • What is a general framework in QD algorithms?

    Allows use of different operators to define quality diversity algorithms for specific tasks.

    algorithms qd
  • What does the selector do in QD algorithms?

    Selects the individual to be mutated and evaluated in the next generation.

    selector qd
  • What is the simplest selection method used in MAP-Elites?

    Uniform random selection over the solutions in the container.

    selection map-elites
  • What are the criteria for proportional selection in QD?

    Fitness, novelty, curiosity score.

    selection criteria
  • How can solutions be stored in QD?

    Discretised grid (like MAP-Elites) or unstructured archive (like Novelty Search).

    storage qd
  • What is a key feature of the unstructured archive in QD?

    Maintains density instead of strict discretisation.

    archive density
  • What is the process for using advanced mutations in QD?

    Select multiple operators in stochastic selection, then apply cross-over before mutation.

    mutations cross-over
  • What is the QD algorithm for teaching a robot to walk?

    Unstructured archive + random uniform selector.

    robotics qd
  • What is the behavioral descriptor for the walking robot?

    X/Y coordinate position of the robot after 3 seconds.

    robotics behavior
  • What is the fitness measure for the walking robot?

    Angular error at the end of the trajectory w.r.t. an ideal circular trajectory.

    fitness robotics
  • What is the QD algorithm for teaching a robot to push a cube?

    MAP-Elites (grid + random uniform selector).

    robotics qd
  • What is the behavioral descriptor for the cube-pushing robot?

    Final position of the cube, where diversity is desired.

    robotics behavior
  • What is the fitness measure for the cube-pushing robot?

    Energy efficiency of the movement.

    fitness robotics
  • What are genetic algorithms, evolutionary strategies, and evolutionary algorithms based on?

    The same basic concepts.

    algorithms evolutionary