Sign up to unlock more features

Save this deck to your account
Study flashcards with spaced repetition
Export to Anki (.apkg) or PDF
Process documents up to 100 pages
Images extracted from PDFs and documents
Better text extraction from your PDFs and documents
Better flashcards with our more advanced AI model

Intro to Machine Learning Notes

Public

Save & Edit

1 / 794

What is the definition of artificial intelligence according to Kurzweil, 1990?

2 / 794

What is Computational Intelligence according to Poole et al., 1998?

3 / 794

What does Nilsson, 1998 say about AI?

4 / 794

What is the focus of Charnak and McDermott, 1985 regarding AI?

5 / 794

What is Winston, 1992's perspective on AI?

6 / 794

What is Haugeland, 1986's definition of AI?

7 / 794

What is Bellman, 1978's view on AI?

8 / 794

What approach does this course take towards AI?

9 / 794

What is a key application of ML mentioned?

10 / 794

What is another application of ML?

11 / 794

What is an example of ML application in medicine?

12 / 794

What is machine learning?

13 / 794

Who defined machine learning in 1977?

14 / 794

What is the focus of machine learning according to Tom Mitchell's 1997 definition?

15 / 794

What does the function 'f' calculate in the example?

16 / 794

What is 'h' in the example?

17 / 794

What are the three main categories of machine learning settings?

18 / 794

What does supervised learning do?

19 / 794

What is unsupervised learning?

20 / 794

What is clustering in unsupervised learning?

21 / 794

What is dimensionality reduction?

22 / 794

What is reinforcement learning?

23 / 794

What is policy search in reinforcement learning?

24 / 794

What is semi-supervised learning?

25 / 794

What is weakly-supervised learning?

26 / 794

What is classification in machine learning?

27 / 794

What is binary classification?

28 / 794

What is multi-class classification?

29 / 794

What is multi-label classification?

30 / 794

What is regression in machine learning?

31 / 794

What is simple regression?

32 / 794

What is multiple regression?

33 / 794

What is simple regression?

34 / 794

What is multiple regression?

35 / 794

What is multivariate regression?

36 / 794

What is an example regression problem?

37 / 794

What characterizes a bad predictor in regression?

38 / 794

What characterizes a good predictor in regression?

39 / 794

What characterizes a very good predictor in regression?

40 / 794

What is supervised learning?

41 / 794

How does Antoine classify shapes?

42 / 794

What is a linear classifier?

43 / 794

What have we learnt about data in predictions?

44 / 794

Why is selecting good features important?

45 / 794

What are two ways to make predictions?

46 / 794

What is the goal of generating a model in supervised learning?

47 / 794

What is the training dataset defined as?

48 / 794

What is feature encoding in supervised learning?

49 / 794

What is the purpose of the Xtest dataset?

50 / 794

What do we compute to measure model performance?

51 / 794

What is the purpose of the truth/gold standard annotation?

52 / 794

What is the first step in the complete pipeline?

53 / 794

Why is it important to examine data before designing an algorithm?

54 / 794

What happens if class labels are imbalanced?

55 / 794

What should you do with features before starting an algorithm?

56 / 794

How do you normalize features?

57 / 794

What is the curse of dimensionality?

58 / 794

What is feature selection?

59 / 794

What is feature extraction?

60 / 794

What is the Bag of Words method in NLP?

61 / 794

What is the modern approach to feature encoding in deep learning?

62 / 794

What is a lazy learner?

63 / 794

What is an eager learner?

64 / 794

What is the opposite of the other guy in ML models?

65 / 794

What is a Non-Parametric Model?

66 / 794

What is an example of a Non-Parametric Model?

67 / 794

How does a nearest neighbour classifier work?

68 / 794

What is a Linear Model?

69 / 794

What does a Linear Model classify?

70 / 794

What is a Non-Linear Model?

71 / 794

What is Feature Space Transformation?

72 / 794

How do SVMs solve non-linear datasets?

73 / 794

How do Neural Networks handle non-linear datasets?

74 / 794

What is the Bias-Variance trade-off?

75 / 794

What is Occam’s razor in ML?

76 / 794

What does MSE stand for?

77 / 794

Is 85% accuracy good?

78 / 794

What is the Baseline in performance evaluation?

79 / 794

What is the Upper bound in performance evaluation?

80 / 794

What is K-Nearest Neighbours?

81 / 794

What are Decision Trees in ML?

82 / 794

What does the Nearest Neighbour Classifier do?

83 / 794

What does k-NN stand for?

84 / 794

What type of model is k-NN?

85 / 794

What is a major problem with k-NN?

86 / 794

What is the solution to overfitting in k-NN?

87 / 794

What does increasing k do to the classifier?

88 / 794

How should k be chosen in k-NN?

89 / 794

What are some distance metrics used in k-NN?

90 / 794

What does distance-weighted k-NN do?

91 / 794

What happens if k=N in weighted k-NN?

92 / 794

What is a disadvantage of k-NN for large datasets?

93 / 794

What is the curse of dimensionality in k-NN?

94 / 794

How does k-NN perform regression?

95 / 794

What is the principle of decision trees?

96 / 794

What type of learners are decision trees?

97 / 794

What is decision tree learning?

98 / 794

How can decision trees be represented?

99 / 794

What type of search do decision tree learning algorithms use?

100 / 794

Name some algorithms for constructing decision trees.

101 / 794

What is the first step in the general decision tree algorithm?

102 / 794

What is the goal of finding an optimal split rule?

103 / 794

What does Information Gain measure?

104 / 794

What does Gini Impurity measure?

105 / 794

What is Variance Reduction mainly used for?

106 / 794

Who introduced the concept of entropy in information theory?

107 / 794

What does entropy measure?

108 / 794

What is the formula for the amount of information required to determine the state of a random variable?

109 / 794

How is the amount of information related to probability?

110 / 794

What happens to information required when the impostor is more likely in one box?

111 / 794

What is the information required when the impostor is equally likely in 4 boxes?

112 / 794

What does low entropy indicate?

113 / 794

What does high entropy indicate?

114 / 794

What is the entropy of box 1?

115 / 794

What is the entropy of box 2?

116 / 794

How is entropy defined?

117 / 794

What is the continuous entropy formula?

118 / 794

What does a 50:50 split of information represent?

119 / 794

What is information gain?

120 / 794

What is the formula for information gain?

121 / 794

What is the binary tree information gain formula?

122 / 794

What are ordered values in decision trees?

123 / 794

What are categorical values in decision trees?

124 / 794

What is the first step in using ID3 algorithm?

125 / 794

What is the entropy of the dataset D with 9 positive and 5 negative outcomes?

126 / 794

What is the entropy for 'sunny' outcomes?

127 / 794

What is the entropy for 'overcast' outcomes?

128 / 794

What is the entropy for 'rain' outcomes?

129 / 794

What is the formula for information gain for 'outlook'?

130 / 794

What is the total number of days in the dataset?

131 / 794

What is the information gain for 'outlook'?

132 / 794

What happens to the 'overcast' subset?

133 / 794

What is a common issue with decision trees?

134 / 794

What is one method to deal with overfitting in decision trees?

135 / 794

What is the validation set size in cross-validation?

136 / 794

What is the first step in pruning a decision tree?

137 / 794

What does a random forest consist of?

138 / 794

What is the outcome of the algorithm in a random forest?

139 / 794

What do regression trees predict?

140 / 794

What is used instead of information gain for regression trees?

141 / 794

How do you make predictions with regression trees?

142 / 794

What is the purpose of taking an average in machine learning predictions?

143 / 794

What is the ultimate goal when creating machine learning systems?

144 / 794

What is a held-out test dataset used for?

145 / 794

Why is shuffling important before splitting a dataset?

146 / 794

What are hyperparameters in machine learning?

147 / 794

What is the motivation behind hyperparameter tuning?

148 / 794

What is a disadvantage of testing hyperparameters on the training dataset?

149 / 794

What should never be done when evaluating hyperparameters?

150 / 794

What is the correct approach for dataset splitting in machine learning?

151 / 794

What is the purpose of the validation set?

152 / 794

What is hyperparameter tuning/optimisation?

153 / 794

What can be done for final evaluation after hyperparameter tuning?

154 / 794

What can be included in the training set for final evaluation?

155 / 794

What is the purpose of including the validation set in training?

156 / 794

When is the final evaluation done?

157 / 794

What is a risk of developing and evaluating a model on the same data?

158 / 794

What should the test set be used for?

159 / 794

What is cross-validation used for?

160 / 794

What are the steps in cross-validation?

161 / 794

What does the global error estimate formula represent?

162 / 794

What is important about cross-validation in model evaluation?

163 / 794

What is one option for parameter tuning during cross-validation?

164 / 794

What is an alternative method for parameter tuning in cross-validation?

165 / 794

How does the second option for parameter tuning help?

166 / 794

What is the advantage of using different hyperparameters on each fold during cross-validation?

167 / 794

What is a disadvantage of using different hyperparameters on each fold?

168 / 794

What is the advantage of testing on all data when going into production?

169 / 794

What is a disadvantage of testing on all data?

170 / 794

What are the steps in CASE 1 for plenty of data available?

171 / 794

What are the steps in CASE 2 for limited data available?

172 / 794

What does a confusion matrix represent?

173 / 794

What is accuracy in model evaluation?

174 / 794

How is classification error calculated?

175 / 794

What is precision in model evaluation?

176 / 794

What does high precision indicate?

177 / 794

What is recall in model evaluation?

178 / 794

What is the precision for Class 1?

179 / 794

What is the formula for recall?

180 / 794

What is the recall for Class 1?

181 / 794

What does high recall indicate?

182 / 794

What is the trade-off between precision and recall?

183 / 794

What is macro-averaged recall for two classes?

184 / 794

What does the F-measure combine?

185 / 794

What is the formula for F1 score?

186 / 794

What does a confusion matrix evaluate?

187 / 794

What is accuracy in classification?

188 / 794

What is the difference between micro-averaging and macro-averaging?

189 / 794

What is the effect of micro-averaging on precision, recall, and F1 in binary and multi-class classification?

190 / 794

What is micro-averaged precision, recall, and F1 equal to?

191 / 794

What is the most common evaluation metric for regression tasks?

192 / 794

How is MSE calculated?

193 / 794

What does a lower MSE indicate?

194 / 794

What does RMSE stand for?

195 / 794

How is RMSE calculated?

196 / 794

What are the five important model characteristics in ML?

197 / 794

What is a balanced dataset?

198 / 794

What is an imbalanced dataset?

199 / 794

What can affect accuracy in imbalanced datasets?

200 / 794

What does macro-averaged recall help detect?

201 / 794

What is a solution for imbalanced test sets?

202 / 794

What does a normalized confusion matrix achieve?

203 / 794

What is one view of system performance on a balanced test set?

204 / 794

What should be evaluated for a more realistic scenario?

205 / 794

What is one solution to balance classes?

206 / 794

What is another solution to balance classes?

207 / 794

What does overfitting indicate about model performance?

208 / 794

What does underfitting indicate about model performance?

209 / 794

What happens to classification error as models learn?

210 / 794

What can cause overfitting?

211 / 794

How can we fight overfitting?

212 / 794

What is a confidence interval?

213 / 794

What affects confidence in an evaluation result?

214 / 794

What is the impact of a small test set on accuracy?

215 / 794

What affects confidence in evaluation results?

216 / 794

What is true error?

217 / 794

How is true error mathematically defined?

218 / 794

What is sample error?

219 / 794

How is sample error mathematically defined?

220 / 794

What does 𝛿(𝑓(𝑥), ℎ(𝑥)) represent?

221 / 794

What is a confidence interval?

222 / 794

What does a 95% confidence interval [0.2, 0.4] mean?

223 / 794

How does sample size affect confidence intervals?

224 / 794

What is the example confidence interval for errorS(h) = 0.22 with n = 50?

225 / 794

What does statistical significance testing help determine?

226 / 794

What does a graph with overlapping distributions indicate?

227 / 794

What is the Marek ApprovedTM test?

228 / 794

What is the Marek ApprovedTM test?

229 / 794

What do statistical tests determine?

230 / 794

Name three statistical tests mentioned.

231 / 794

How does the Randomisation test work?

232 / 794

What does a small p-value indicate?

233 / 794

What is the null hypothesis?

234 / 794

What is the significance level for performance difference?

235 / 794

What is P-hacking?

236 / 794

What happens if the number of experiments increases in P-hacking?

237 / 794

What is the false positive rate in the example of P-hacking?

238 / 794

What is the false discovery proportion in the initial example?

239 / 794

What happens to the false discovery proportion when experiments increase to 2400?

240 / 794

How many true discoveries were made?

241 / 794

How many false discoveries were made?

242 / 794

What is the false discovery proportion?

243 / 794

What is the sample size of the 'study'?

244 / 794

How many possible relations were searched in the 'study'?

245 / 794

What is a method to defend against unintentional p-hacking?

246 / 794

What is the first step in the Benjamini-Hochberg method?

247 / 794

What does the Bejamini-Hochberg critical value formula represent?

248 / 794

What is the original significance threshold in the Benjamini-Hochberg method?

249 / 794

What is the downside of the Benjamini-Hochberg method?

250 / 794

What are Artificial Neural Networks (ANNs)?

251 / 794

What does Deep Learning refer to?

252 / 794

Why has deep learning become more popular now?

253 / 794

What are perceptrons?

254 / 794

What is backpropagation?

255 / 794

What are LSTMs and CNNs?

256 / 794

What is a benefit of having large datasets for neural networks?

257 / 794

What advancements have improved neural network training?

258 / 794

What operations can be efficiently parallelized on graphics cards?

259 / 794

What has improved the accessibility and affordability of graphics cards?

260 / 794

What are automatic differentiation libraries used for?

261 / 794

What is linear regression useful for in machine learning?

262 / 794

What type of learning is linear regression?

263 / 794

What does the dataset in supervised learning consist of?

264 / 794

What is the goal of supervised learning?

265 / 794

What does the function f represent in linear regression?

266 / 794

What are the desired labels in classification problems?

267 / 794

What are the desired labels in regression problems?

268 / 794

What controls the gradient of a straight line in linear regression?

269 / 794

What does the parameter 'b' represent in linear regression?

270 / 794

What does the loss function measure in linear regression?

271 / 794

What is the formula for the loss function in linear regression?

272 / 794

What does a smaller value of E indicate?

273 / 794

What do derivatives show in the context of linear regression?

274 / 794

What is the purpose of gradient descent?

275 / 794

What does the learning rate (α) control in gradient descent?

276 / 794

What is the learning rate in gradient descent?

277 / 794

What does 𝜕𝐸/𝜕𝑎 represent?

278 / 794

What is the formula for updating parameter 𝑎?

279 / 794

What does an epoch represent in machine learning?

280 / 794

What is the gradient in vector notation?

281 / 794

What is the analytical solution for linear regression?

282 / 794

What is the complexity of matrix inversion?

283 / 794

What is multiple linear regression?

284 / 794

How does the RMSE change with multiple features?

285 / 794

What is RMSE in model evaluation?

286 / 794

How does using more features affect model predictions?

287 / 794

What does a linear regression model represent in higher dimensions?

288 / 794

What is the role of the nucleus in a biological neuron?

289 / 794

What do dendrites do in a biological neuron?

290 / 794

What happens when a biological neuron's axon fires?

291 / 794

What are input features in an artificial neuron?

292 / 794

What determines the importance of a feature in an artificial neuron?

293 / 794

What does the output of an artificial neuron involve?

294 / 794

What is the activation function in an artificial neuron?

295 / 794

How can the bias term be included in the equation?

296 / 794

What is the vector notation for input features and weights?

297 / 794

What is the logistic activation function used for?

298 / 794

What does logistic regression actually do?

299 / 794

How is the logistic regression model optimized?

300 / 794

What is a perceptron?

301 / 794

What activation function does a perceptron use?

302 / 794

What does gradient descent use as its activation function?

303 / 794

What is the output of the activation function in the perceptron?

304 / 794

What is the perceptron learning rule update formula?

305 / 794

What happens when y = 1 and h(x) = 0?

306 / 794

What happens when y = 0 and h(x) = 1?

307 / 794

What types of functions can a perceptron learn?

308 / 794

Why can't a perceptron learn XOR?

309 / 794

What is a weakness of using a single neuron?

310 / 794

What is needed to model complex relationships in data?

311 / 794

What is a multi-layer perceptron (MLP)?

312 / 794

What is the role of hidden layers in a neural network?

313 / 794

What does each block in a block diagram represent?

314 / 794

What is the first and last layer of a neural network called?

315 / 794

What should you check when something isn’t working in a neural network?

316 / 794

What is b in the context of a neural network layer?

317 / 794

What should you check when working with matrices?

318 / 794

What is 'b_' in a neural network?

319 / 794

How many neurons are typically in deep neural networks?

320 / 794

What can multi-layer neural networks learn?

321 / 794

What was the approach to feature crafting before multi-layer networks?

322 / 794

What is end-to-end learning?

323 / 794

What do lower levels of a neural network act as?

324 / 794

What do higher levels of a neural network learn?

325 / 794

What is the benefit of training both feature extraction and classification layers together?

326 / 794

What should you use if the data is linearly separable?

327 / 794

What happens if we only use linear activation functions in a multi-layer network?

328 / 794

What is the simplest activation function?

329 / 794

What does the output of a neuron with linear activation become?

330 / 794

What is the equation for output in a two-layer network?

331 / 794

What is the equation for a linear activation function?

332 / 794

What happens when a two-layer network uses linear activation?

333 / 794

What do non-linear activation functions do?

334 / 794

What is the range of the sigmoid activation function?

335 / 794

What is the formula for the sigmoid activation function?

336 / 794

What is the range of the tanh activation function?

337 / 794

What is the formula for the tanh activation function?

338 / 794

What characterizes the ReLU activation function?

339 / 794

What is the formula for the ReLU activation function?

340 / 794

What does the softmax activation function do?

341 / 794

What is the formula for the softmax activation function?

342 / 794

What is a common activation function for deep neural networks?

343 / 794

Which activation functions are more robust than ReLU?

344 / 794

What is a potential issue with using ReLU?

345 / 794

What should you try first when designing models?

346 / 794

How should the choice of activation function in hidden layers be treated?

347 / 794

How can we set hyperparameters for activation functions?

348 / 794

What determines the choice of activation function in the output layer?

349 / 794

What activation function is commonly used for binary classification?

350 / 794

What activation function should be used for predicting unbounded scores?

351 / 794

What activation function is most commonly used for predicting a probability distribution?

352 / 794

What does Softmax do?

353 / 794

What is the input dimension for the neural network in PyTorch?

354 / 794

How many neurons are in the hidden layer of the PyTorch neural network?

355 / 794

What is the output dimension of the PyTorch neural network?

356 / 794

What activation function is applied in the hidden layer during the forward pass?

357 / 794

What is the purpose of the loss function in neural networks?

358 / 794

How do we update parameters in neural networks?

359 / 794

What is the formula for updating parameters in gradient descent?

360 / 794

What type of task is a regression task?

361 / 794

What is the goal of a regression task?

362 / 794

What is an example of a regression task?

363 / 794

What activation function is often used in the output layer for regression?

364 / 794

What loss function is commonly used in regression?

365 / 794

What is the formula for Mean Squared Error (MSE)?

366 / 794

What does MSE equal when predictions are correct?

367 / 794

What is the primary goal of classification tasks?

368 / 794

What is binary classification?

369 / 794

What is multi-class classification?

370 / 794

What is multi-label classification?

371 / 794

What is the loss function used in classification?

372 / 794

What do we want to maximize in classification?

373 / 794

What is the probability formulation for binary classification?

374 / 794

What happens if the network assigns the correct label for every data point?

375 / 794

What is the issue with multiplying probabilities in classification?

376 / 794

How can we avoid underflow errors in classification?

377 / 794

What is the formula for binary cross-entropy loss?

378 / 794

What is the formula for binary cross-entropy loss?

379 / 794

What does normalizing by the number of data points do in loss calculation?

380 / 794

What is categorical cross-entropy?

381 / 794

What is the formula for categorical cross-entropy loss?

382 / 794

In categorical cross-entropy, what does y_c represent?

383 / 794

What is the output layer configuration for a multi-class classification neural network example?

384 / 794

What activation function is commonly used with categorical cross-entropy loss?

385 / 794

What is batching in neural networks?

386 / 794

Why is batching beneficial for training neural networks?

387 / 794

What does batching allow GPUs to do more efficiently?

388 / 794

How does batching assist in regularization during optimization?

389 / 794

What is the benefit of batching in neural networks?

390 / 794

What is the input matrix X in a neural network?

391 / 794

What does the first layer in a neural network apply?

392 / 794

What is Z in the context of a neural network?

393 / 794

What do we get after applying the activation function to Z?

394 / 794

What is the purpose of calculating loss in a neural network?

395 / 794

What method is used to update model parameters in neural networks?

396 / 794

What is backpropagation in neural networks?

397 / 794

How does backpropagation simplify calculations?

398 / 794

What is the chain rule used for in neural networks?

399 / 794

What does the chain rule formula represent?

400 / 794

How can we find the partial derivative of the loss with respect to W[1]?

401 / 794

What are the two types of partial derivatives in backpropagation?

402 / 794

What is the purpose of the partial derivative in backpropagation?

403 / 794

What does the partial derivative of a matrix w.r.t another matrix represent?

404 / 794

What is the linear transformation notation used in backpropagation?

405 / 794

What do you need to calculate to update weights in a linear transformation?

406 / 794

What is the shape of the partial derivative of a scalar w.r.t a matrix?

407 / 794

What is the key component in the derivatives during backpropagation?

408 / 794

What does backpropagation iteratively calculate?

409 / 794

What is the bias vector used for in backpropagation?

410 / 794

What is necessary for lower levels to calculate their own partial derivatives?

411 / 794

What rule is used to break down the calculations in backpropagation?

412 / 794

What is the significance of the dimensions N, D, and M in backpropagation?

413 / 794

What happens during the forward pass in a neural network?

414 / 794

What does the partial derivative of the loss w.r.t one element depend on?

415 / 794

How many output values does the particular element affect?

416 / 794

What is the equation for the partial derivative of the element?

417 / 794

What happens when you calculate the partial derivative w.r.t the full matrix X?

418 / 794

What do the two matrices in the dot product represent?

419 / 794

What is the importance of backpropagation for inputs X?

420 / 794

How do we calculate the partial derivative w.r.t the weights?

421 / 794

What does one weight affect in the output?

422 / 794

What is the equation for the partial derivative of the loss w.r.t the weights?

423 / 794

What do we need to calculate for the bias vector?

424 / 794

What result do we get for the partial derivative of the loss w.r.t the bias?

425 / 794

What is needed to perform full backpropagation through the neural network?

426 / 794

How are activation functions generally applied?

427 / 794

What is the purpose of activation functions in a neural network?

428 / 794

Do activation functions have parameters that need updating during training?

429 / 794

What is the derivative of an activation function denoted as?

430 / 794

What does the chain rule help with in back propagation?

431 / 794

What is the derivative of the Linear activation function?

432 / 794

What is the formula for the Sigmoid activation function?

433 / 794

What is the formula for the Tanh activation function?

434 / 794

What is the ReLU activation function and its derivative?

435 / 794

How is Softmax different from other activation functions?

436 / 794

What is the purpose of combining Softmax with cross-entropy?

437 / 794

What does the joint partial derivative through Softmax and cross-entropy represent?

438 / 794

What is gradient descent?

439 / 794

What is the formula for updating weights in gradient descent?

440 / 794

What is the learning rate in gradient descent?

441 / 794

What is the formula for updating weights in gradient descent?

442 / 794

What does α represent in gradient descent?

443 / 794

What must be true for gradients to be computed in neural networks?

444 / 794

What is the first step in the general algorithm for gradient descent?

445 / 794

What is the termination condition in gradient descent?

446 / 794

What is a common issue when updating weights during backpropagation?

447 / 794

What is Stochastic Gradient Descent (SGD)?

448 / 794

What are the steps in Stochastic Gradient Descent?

449 / 794

What is Mini-batched Gradient Descent?

450 / 794

What are the steps in Mini-batched Gradient Descent?

451 / 794

What is a challenge in optimising neural networks?

452 / 794

Why is the learning rate important?

453 / 794

What happens if the learning rate is too low?

454 / 794

What happens if the learning rate is too high?

455 / 794

What is the ideal state of the learning rate?

456 / 794

What is the learning rate?

457 / 794

What are adaptive learning rates?

458 / 794

What happens if a parameter has not been updated for a while?

459 / 794

What happens if a parameter is making big updates?

460 / 794

What algorithms work well for adaptive learning rates?

461 / 794

What is learning rate decay?

462 / 794

What is the intuition behind learning rate decay?

463 / 794

When can learning rate decay be performed?

464 / 794

What is the simplest approach to weight initialization?

465 / 794

Why should we not set all weights to zero?

466 / 794

What is a common method for weight initialization?

467 / 794

What does Xavier Glorot initialization do?

468 / 794

What is the formula used in Xavier Glorot initialization?

469 / 794

What role does randomness play in neural networks?

470 / 794

What role does randomness play in neural networks?

471 / 794

What is the solution to controlling randomness in neural networks?

472 / 794

What can happen when processes are parallelised on GPUs?

473 / 794

How should you report model performance under different random seeds?

474 / 794

What is min-max normalisation?

475 / 794

What is the formula for min-max normalisation?

476 / 794

What is standardisation (z-normalisation)?

477 / 794

What is the formula for standardisation?

478 / 794

Why is normalisation important in neural networks?

479 / 794

What should you remember about normalisation for data columns?

480 / 794

How should normalising constants be calculated?

481 / 794

What is gradient checking?

482 / 794

What are the two methods to isolate the gradient?

483 / 794

What is the formula for the gradient using weight difference?

484 / 794

What is the formula for measuring change in loss?

485 / 794

What is the definition of a partial derivative?

486 / 794

What indicates a bug in neural network training?

487 / 794

What is overfitting in neural networks?

488 / 794

How can overfitting be prevented?

489 / 794

What is network capacity?

490 / 794

What does it mean if a model is underfitting?

491 / 794

How can you improve a model that is underfitting?

492 / 794

What indicates a model is overfitting?

493 / 794

What is one method to prevent overfitting?

494 / 794

What is the best solution to overfitting?

495 / 794

What is early stopping in neural network training?

496 / 794

What is regularization in the context of neural networks?

497 / 794

What are L2 and L1 regularization?

498 / 794

What does L2 regularization do to weights?

499 / 794

What does L2 regularisation do?

500 / 794

What is the formula for L2 regularisation loss function?

501 / 794

How does L2 regularisation affect weight updates?

502 / 794

What is the role of the hyperparameter λ in L2 regularisation?

503 / 794

What does L1 regularisation do?

504 / 794

What is the formula for L1 regularisation loss function?

505 / 794

How does L1 regularisation affect weight updates?

506 / 794

How do L1 and L2 regularisation differ in weight management?

507 / 794

What is dropout in neural networks?

508 / 794

What percentage of neurons are typically dropped during training with dropout?

509 / 794

What happens during testing when using dropout?

510 / 794

What is the difference between supervised and unsupervised learning?

511 / 794

What is the objective of unsupervised learning?

512 / 794

What is unsupervised learning?

513 / 794

What is the objective of unsupervised learning?

514 / 794

What is clustering in unsupervised learning?

515 / 794

What is density estimation?

516 / 794

What is dimensionality reduction?

517 / 794

Name a famous algorithm for dimensionality reduction.

518 / 794

What does clustering imply about intra-cluster variance?

519 / 794

What is the k-means algorithm used for?

520 / 794

What are the steps of the k-means algorithm?

521 / 794

What is a cluster in clustering?

522 / 794

How does clustering help in vector quantization?

523 / 794

What is an example of using clustering in nature?

524 / 794

What is the structure of an unsupervised learning task?

525 / 794

What does k represent in k-means clustering?

526 / 794

What is the first step in the k-means algorithm?

527 / 794

What is the goal of the assignment step in k-means?

528 / 794

How do we update centroids in k-means?

529 / 794

What is checked during the convergence step in k-means?

530 / 794

What are Voronoi diagrams?

531 / 794

What is the formula for the assignment step in k-means?

532 / 794

What does the update formula in k-means compute?

533 / 794

What condition indicates convergence in k-means?

534 / 794

What is checked in Step 4 of K-means?

535 / 794

What indicates to stop iterating in K-means?

536 / 794

How is K-means viewed as a model?

537 / 794

What is the objective of K-means?

538 / 794

What does the loss function L represent?

539 / 794

What is the significance of K in K-means?

540 / 794

What is the Elbow Method used for?

541 / 794

What should be selected according to the Elbow Method?

542 / 794

What does cross-validation help determine?

543 / 794

What are the strengths of K-means?

544 / 794

What is a significant weakness of K-means?

545 / 794

What is a significant hyperparameter in K-means?

546 / 794

What is a weakness of K-means regarding its results?

547 / 794

What technique can improve K-means initialization?

548 / 794

When is K-means applicable?

549 / 794

What algorithm works with categorical data in clustering?

550 / 794

How does K-medioid algorithm differ from K-means?

551 / 794

What shape must clusters have for K-means to work effectively?

552 / 794

What is the objective of density estimation algorithms?

553 / 794

What does a Probability Density Function (PDF) model?

554 / 794

What must the integral of a PDF over its range equal?

555 / 794

What is one application of density estimation?

556 / 794

What is the goal of generative models in relation to probability?

557 / 794

What do discriminative models directly model?

558 / 794

What activation function transforms neural network output into a probability distribution?

559 / 794

What does the Softmax activation do?

560 / 794

What is Bayes’ rule used for in generative models?

561 / 794

What is the formula for Bayes’ rule?

562 / 794

What do non-parametric methods assume about function shape?

563 / 794

What is an example of a non-parametric method?

564 / 794

What is the bias and variance characteristic of non-parametric methods?

565 / 794

What do histograms do in density estimation?

566 / 794

What does normalization ensure in histograms?

567 / 794

What is Kernel Density Estimation?

568 / 794

What does the kernel function do in density estimation?

569 / 794

What is a Parzen window?

570 / 794

What type of distribution can be used as a kernel in density estimation?

571 / 794

What are the characteristics of parametric approaches?

572 / 794

What is the univariate Gaussian distribution parameterized by?

573 / 794

What is ensured by the normalization factor in Gaussian distribution?

574 / 794

What does the multivariate Gaussian distribution take as input?

575 / 794

What is the input of the Multivariate Gaussian Distribution?

576 / 794

What replaces variance in the Multivariate Gaussian Distribution?

577 / 794

What is the purpose of the normalization term in the Multivariate Gaussian Distribution?

578 / 794

What does likelihood determine in a model?

579 / 794

What assumption is made about the datapoints in the training set?

580 / 794

What do we multiply to get the likelihood in a dataset?

581 / 794

Why do we calculate negative log-likelihood instead of likelihood?

582 / 794

What does Gaussian fitting minimize?

583 / 794

What happens when you take the log of a multiplication term?

584 / 794

Is the Gaussian distribution sufficient for modeling densities in all cases?

585 / 794

What is the problem with fitting a Gaussian distribution to bimodal data?

586 / 794

What is a potential solution to the limitations of Gaussian distributions?

587 / 794

How is the PDF of mixture models defined?

588 / 794

What constraints does the mixing proportion 𝜋𝑘 follow?

589 / 794

What does the Gaussian Mixture Model (GMM) estimate?

590 / 794

What is the Gaussian Mixture Model a weighted sum of?

591 / 794

What is a Gaussian Mixture Model (GMM)?

592 / 794

What does GMM ensure about the PDF?

593 / 794

What is the purpose of GMMs?

594 / 794

What algorithm is used to fit GMM to training examples?

595 / 794

What are the two main steps of the EM algorithm?

596 / 794

What is done in the E-step of the EM algorithm?

597 / 794

How is the responsibility calculated in the E-step?

598 / 794

What is updated in the M-step of the EM algorithm?

599 / 794

How is the mean updated in the M-step?

600 / 794

What is checked for convergence in the EM algorithm?

601 / 794

What is the Bayesian Information Criterion (BIC)?

602 / 794

What is the formula for BIC?

603 / 794

What does \( \mathcal{L}(K) \) represent in the BIC formula?

604 / 794

What is the penalty term in the BIC formula?

605 / 794

What does N represent in the BIC formula?

606 / 794

What is the formula for Ck?

607 / 794

What does ℒ(K) represent?

608 / 794

What does Pk/(2 log(N)) represent?

609 / 794

What does N represent in the context?

610 / 794

What does Pk represent?

611 / 794

How many parameters does a 2D Gaussian have?

612 / 794

What are the parameters for the mean in 2D Gaussian?

613 / 794

How many parameters are needed for covariance in 2D Gaussian?

614 / 794

What is the purpose of the -1 in the parameter count?

615 / 794

What principle is suggested for model selection?

616 / 794

What happens to BIC values as K increases?

617 / 794

What is cross-validation used for?

618 / 794

What are the steps in cross-validation for GMM-EM?

619 / 794

What is a key similarity between GMM and K-means?

620 / 794

What does convergence mean in GMM and K-means?

621 / 794

How does GMM initialization relate to K-means?

622 / 794

What is soft clustering in GMM?

623 / 794

What distance metric is used in GMM?

624 / 794

What is the focus of Module 7?

625 / 794

What is the purpose of genetic/evolutionary algorithms?

626 / 794

What is a Genetic/Evolutionary Algorithm?

627 / 794

What is reinforcement learning?

628 / 794

What do traditional RL algorithms deal with?

629 / 794

What do policy search algorithms deal with?

630 / 794

What are Black-Box Optimisation Algorithms?

631 / 794

What is an example of black-box optimisation in robotics?

632 / 794

What year did Darwin publish his theory about the origin of species?

633 / 794

What are the four main concepts of Darwin's theory?

634 / 794

Who discovered principles of statistical inheritance?

635 / 794

What did Weismann discover in 1883?

636 / 794

What did Watson, Crick, and Franklin discover in 1953?

637 / 794

What is a gene?

638 / 794

What is a genotype?

639 / 794

What is a phenotype?

640 / 794

What are the three main families of genetic algorithms proposed in the 60s?

641 / 794

What is Genetic Programming?

642 / 794

What is the main concept of genetic/evolutionary algorithms?

643 / 794

What happens to the worst-performing functions in genetic algorithms?

644 / 794

What is observed in the black box function?

645 / 794

What happens to the worst functions in the process?

646 / 794

What is done to better performing functions?

647 / 794

What is the result of repeating the evolutionary process?

648 / 794

What principle do these algorithms use as a base?

649 / 794

How is each solution represented in evolutionary algorithms?

650 / 794

What function measures the performance of phenotypes?

651 / 794

What is the selection operator?

652 / 794

What does the cross-over operator do?

653 / 794

What is the mutation operator?

654 / 794

What is the genotype in a genetic algorithm?

655 / 794

What is the genotype in genetic programming?

656 / 794

What does the mutation in evolutionary strategies draw from?

657 / 794

What term is used to describe the blurred lines between algorithm families?

658 / 794

What is the goal of the Mastermind game?

659 / 794

How many colors can each piece have in Mastermind?

660 / 794

What is the fitness function for Mastermind?

661 / 794

What does p1 represent in the Mastermind fitness function?

662 / 794

What does p2 represent in the Mastermind fitness function?

663 / 794

What does p2 represent in the context of fitness functions?

664 / 794

What is the formula for the fitness function F(x)?

665 / 794

What is the goal of evolutionary algorithms regarding fitness functions?

666 / 794

What is F(x) for solving the problem in this context?

667 / 794

What is the fitness function for teaching a robot to walk?

668 / 794

What is the fitness function for teaching a robot to throw an object?

669 / 794

What do genotype and phenotype represent in problem-solving?

670 / 794

What is the genotype for the Mastermind game?

671 / 794

How is the phenotype created from the genotype in the Mastermind game?

672 / 794

What do integers correspond to in the Mastermind game?

673 / 794

What is done with invalid genotypes in the Mastermind game?

674 / 794

What is the purpose of selection operators in evolutionary algorithms?

675 / 794

What is a standard approach for selection in evolutionary algorithms?

676 / 794

How does the biased roulette wheel process work?

677 / 794

What is the first step in the biased roulette wheel process?

678 / 794

What is the alternative to the roulette wheel selection method?

679 / 794

What is elitism in evolutionary algorithms?

680 / 794

What fraction is usually fixed for elitism?

681 / 794

What is the role of the crossover operator?

682 / 794

What is a common method for crossover?

683 / 794

What is the role of the mutation operator?

684 / 794

How is standard mutation on binary strings performed?

685 / 794

What is the first step in rd mutation on binary strings?

686 / 794

What happens if the generated number is lower than probability m?

687 / 794

What is m typically set to in rd mutation?

688 / 794

What is the purpose of the specific mutation in the Mastermind problem?

689 / 794

What is a common stopping criterion for evolutionary algorithms?

690 / 794

What fitness value indicates an optimal solution in the example?

691 / 794

What is another stopping criterion besides reaching a fitness value?

692 / 794

What is the first step in the evolutionary algorithm flowchart?

693 / 794

What do we do after evaluating the population in the evolutionary loop?

694 / 794

What is elitism in the context of evolutionary algorithms?

695 / 794

What is the function used to evaluate fitness in Mastermind?

696 / 794

What are evolutionary strategies designed to optimize?

697 / 794

What is the main difference between genetic algorithms and evolutionary strategies?

698 / 794

What does the μ + λ evolutionary strategy represent?

699 / 794

What is the first step in the μ + λ evolutionary strategy?

700 / 794

What is the selection process in the μ + λ strategy?

701 / 794

What is the first step in the evolutionary strategy process?

702 / 794

What do you do after generating the population?

703 / 794

How many best individuals are selected as parents?

704 / 794

What is generated from the parents in the evolutionary strategy?

705 / 794

What is the formula for generating offspring?

706 / 794

How is the population defined in the evolutionary strategy?

707 / 794

What is the main challenge in evolutionary strategies?

708 / 794

What happens if 𝜎 is too large?

709 / 794

What happens if 𝜎 is too small?

710 / 794

How can 𝜎 be adjusted over time?

711 / 794

What is the new genotype defined as?

712 / 794

How is the new offspring's sigma calculated?

713 / 794

What does the learning rate depend on?

714 / 794

Why is substituting 𝜎 with 𝜏0 beneficial?

715 / 794

What is a variant of evolutionary strategies?

716 / 794

What is an approach to genetic algorithms?

717 / 794

What is the goal of taking inspiration from natural evolution?

718 / 794

What is the purpose of novelty search?

719 / 794

What does the rch algorithm focus on instead of fitness?

720 / 794

What is the purpose of the novelty archive?

721 / 794

How is novelty calculated?

722 / 794

What does a larger novelty indicate?

723 / 794

What does the behavioral descriptor characterize?

724 / 794

Why is the behavioral descriptor task-specific?

725 / 794

What can happen if a feature is ignored in the behavioral descriptor?

726 / 794

What is an example of a behavioral descriptor for a robot?

727 / 794

What problem can a fitness-focused algorithm encounter?

728 / 794

How does novelty search differ from traditional evolutionary algorithms?

729 / 794

What is the goal of Quality-Diversity Optimization?

730 / 794

What does the concept of Quality-Diversity Optimization apply to?

731 / 794

What is a potential benefit of novelty search for a bipedal robot?

732 / 794

What is the goal of high-dimensional hyperspace exploration?

733 / 794

What does the concept of behavioural descriptors help generate?

734 / 794

How many degrees of freedom does the robot in the example have?

735 / 794

How many real-valued dimensions are there for the robot's movement?

736 / 794

What is the behavioural descriptor for the robot's movement?

737 / 794

What is the goal of varying the proportions of time each leg spends touching the ground?

738 / 794

How many ways to walk were found using the MAP-Elites algorithm?

739 / 794

What are the two main focuses of Quality-Diversity (QD) algorithms?

740 / 794

What is a fitness function used for in QD algorithms?

741 / 794

What does the behavioural descriptor characterize in QD algorithms?

742 / 794

What does Novelty Search with Local Competition optimize?

743 / 794

What is the concept of Local Competition in QD algorithms?

744 / 794

What does LC(x) represent in Local Competition?

745 / 794

What happens when a better version of a solution is found in the archive?

746 / 794

What is the goal of MAP-Elites?

747 / 794

What does MAP-Elites stand for?

748 / 794

What is the main advantage of MAP-Elites?

749 / 794

What is a disadvantage of MAP-Elites?

750 / 794

How does MAP-Elites add new solutions?

751 / 794

What is the hyper-parameter in MAP-Elites?

752 / 794

What is the first step in the MAP-Elites process?

753 / 794

What happens during the mutation operator in MAP-Elites?

754 / 794

What is a common metric for diversity in MAP-Elites?

755 / 794

What does the QD-score represent?

756 / 794

What is the trade-off in QD algorithms represented by?

757 / 794

What is the usual metric for performance in MAP-Elites?

758 / 794

What does the coverage refer to in MAP-Elites?

759 / 794

What is the purpose of local competition in the algorithm?

760 / 794

What is the addition mechanism in MAP-Elites?

761 / 794

What is a general framework in QD algorithms?

762 / 794

What does the selector do in QD algorithms?

763 / 794

What is the simplest selection method used in MAP-Elites?

764 / 794

What are the criteria for proportional selection in QD?

765 / 794

How can solutions be stored in QD?

766 / 794

What is a key feature of the unstructured archive in QD?

767 / 794

What is the process for using advanced mutations in QD?

768 / 794

What is the QD algorithm for teaching a robot to walk?

769 / 794

What is the behavioral descriptor for the walking robot?

770 / 794

What is the fitness measure for the walking robot?

771 / 794

What is the QD algorithm for teaching a robot to push a cube?

772 / 794

What is the behavioral descriptor for the cube-pushing robot?

773 / 794

What is the fitness measure for the cube-pushing robot?

774 / 794

What are genetic algorithms, evolutionary strategies, and evolutionary algorithms based on?

775 / 794

776 / 794

777 / 794

778 / 794

779 / 794

780 / 794

781 / 794

782 / 794

783 / 794

784 / 794

785 / 794

786 / 794

787 / 794

788 / 794

789 / 794

790 / 794

791 / 794

792 / 794

793 / 794

794 / 794

Searching...

Flashcards in this deck (794)

What is the definition of artificial intelligence according to Kurzweil, 1990?

The art of creating machines that perform functions requiring intelligence

ai definition
What is Computational Intelligence according to Poole et al., 1998?

The study of the design of intelligent agents

computational_intelligence definition
What does Nilsson, 1998 say about AI?

AI is concerned with intelligent behaviour in artifacts

ai definition
What is the focus of Charnak and McDermott, 1985 regarding AI?

The study of mental faculties through computational models

ai definition
What is Winston, 1992's perspective on AI?

The study of computations that enable perception, reasoning, and acting

ai definition
What is Haugeland, 1986's definition of AI?

The effort to make computers think like humans

ai definition
What is Bellman, 1978's view on AI?

The automation of activities associated with human thinking

ai definition
What approach does this course take towards AI?

The human route, programming computers to act humanly or learn from experience

course_approach ai
What is a key application of ML mentioned?

Robotics

applications ml
What is another application of ML?

Self-driving cars

applications ml
What is an example of ML application in medicine?

Detecting sepsis in MRI scans

applications ml medicine
What is machine learning?

The field of machine learning is concerned with constructing computer programs that automatically improve with experience.

machine_learning definition
Who defined machine learning in 1977?

Tom Mitchell.

machine_learning history
What is the focus of machine learning according to Tom Mitchell's 1997 definition?

A computer program learns from experience E with respect to tasks T and performance measure P if its performance improves with experience.

machine_learning definition
What does the function 'f' calculate in the example?

Student's grades in Intro to ML.

machine_learning example
What is 'h' in the example?

An approximation function used to estimate grades based on past data.

machine_learning example
What are the three main categories of machine learning settings?

Supervised, Unsupervised, and Reinforcement.

machine_learning categories
What does supervised learning do?

Produces a model capable of generating correct output labels.

machine_learning supervised
What is unsupervised learning?

No labels are given; algorithms find patterns in the data.

machine_learning unsupervised
What is clustering in unsupervised learning?

Dividing data into groups based on similarities, like dogs and cats.

machine_learning clustering
What is dimensionality reduction?

Identifying important features in data, like enhancing a blurry image of a face.

machine_learning dimensionality_reduction
What is reinforcement learning?

An algorithm interacts with the environment to produce a reward signal for improvement.

machine_learning reinforcement
What is policy search in reinforcement learning?

Finding actions for an agent to maximize received rewards based on its state.

machine_learning policy_search
What is semi-supervised learning?

Some data have labels, some do not; aims to label unlabelled data using labelled items.

machine_learning semi_supervised
What is weakly-supervised learning?

Inexact output labels; e.g., indicating an item is somewhere in an image without precise location.

machine_learning weakly_supervised
What is classification in machine learning?

Assigning discrete or categorical variables to inputs, like predicting actions in videos.

machine_learning classification
What is binary classification?

A classification task with only 2 labels to choose from.

machine_learning binary_classification
What is multi-class classification?

A classification task with more than 2 labels to choose from.

machine_learning multi_class_classification
What is multi-label classification?

A classification task where multiple labels can be correct for a single input.

machine_learning multi_label_classification
What is regression in machine learning?

Assigning a real/continuous float value to an input.

machine_learning regression
What is simple regression?

A regression with 1 input variable and 1 output variable.

machine_learning simple_regression
What is multiple regression?

A regression with multiple input variables and 1 output variable.

machine_learning multiple_regression
What is simple regression?

1 input variable and 1 output variable. E.g., size of a house predicts its price.

regression simple
What is multiple regression?

Multiple input variables and 1 output variable. E.g., grade calculator with 3 inputs and 1 output (grade).

regression multiple
What is multivariate regression?

Multiple inputs to predict multiple outputs. E.g., predicting the location of an umbrella from a picture.

regression multivariate
What is an example regression problem?

Given time as input, the regressor predicts the value at that time.

regression example
What characterizes a bad predictor in regression?

The line is far off from almost all points.

regression predictor
What characterizes a good predictor in regression?

The line is close to most points, even if it is off.

regression predictor
What characterizes a very good predictor in regression?

It predicts given points well but may struggle with unknown examples.

regression predictor
What is supervised learning?

Most common setting in ML problems, typically involves classification and regression.

machine_learning supervised
How does Antoine classify shapes?

By placing data along 2 axes (colour and points) to create a classifier.

classification antoine
What is a linear classifier?

A classifier that uses a straight line to separate data into categories.

classification linear
What have we learnt about data in predictions?

More data leads to more accurate predictions.

predictions data
Why is selecting good features important?

Good features improve prediction accuracy; combining features is often better.

features predictions
What are two ways to make predictions?
1. Looking at neighbors 2. Slicing space into partitions with lines.
predictions methods
What is the goal of generating a model in supervised learning?

To approximate the true function using input data to predict outputs.

supervised_learning model
What is the training dataset defined as?

A sequence of pairs of input and output labels (Xn and yn).

dataset training
What is feature encoding in supervised learning?

Transforming raw input observations into a modified version (feature space).

features encoding
What is the purpose of the Xtest dataset?

To evaluate model performance on unseen data by comparing predicted outputs with ground truth.

testing evaluation
What do we compute to measure model performance?

A score comparing predicted outputs with the ground truth/gold standard annotation.

evaluation performance
What is the purpose of the truth/gold standard annotation?

To compute a score measuring model performance.

annotation model performance
What is the first step in the complete pipeline?

Feature Encoding

pipeline feature encoding
Why is it important to examine data before designing an algorithm?

It can provide clues for classifier design and help identify class label distribution.

data algorithm design
What happens if class labels are imbalanced?

The algorithm may learn to identify only the majority class.

class imbalance algorithm
What should you do with features before starting an algorithm?

Normalize your features.

features normalization
How do you normalize features?

Subtract the mean and divide by the standard deviation.

normalization features
What is the curse of dimensionality?

As dimensions increase, data becomes sparse and training data may be noisy.

dimensionality sparse data
What is feature selection?

Choosing a subset of original features to work with.

feature selection
What is feature extraction?

Generating a new set of features from the original features.

feature extraction
What is the Bag of Words method in NLP?

Logging the frequency of words without tracking their positions.

nlp bagofwords method
What is the modern approach to feature encoding in deep learning?

Letting the algorithm figure out optimal features from raw data.

deep learning feature
What is a lazy learner?

Stores training examples and generalizes upon explicit request at test time.

lazy learning algorithm
What is an eager learner?

Constructs a general description of the target function before test time.

eager learning algorithm
What is the opposite of the other guy in ML models?

Learns and generalises all it can before test time, resulting in quicker test time.

machine_learning models
What is a Non-Parametric Model?

Assumes no fixed form; trusts the data instead of a function.

machine_learning non_parametric
What is an example of a Non-Parametric Model?

Nearest neighbour is a lazy learner.

machine_learning nearest_neighbour
How does a nearest neighbour classifier work?

Looks at the nearest neighbour and classifies itself as the same.

machine_learning classification
What is a Linear Model?

Assumes the data is linearly separable, learning the best line to separate it.

machine_learning linear_model
What does a Linear Model classify?

Anything on the left as a green diamond, anything on the right as a red circle.

machine_learning classification
What is a Non-Linear Model?

Used for non-linearly separable problems with more complex models.

machine_learning non_linear_model
What is Feature Space Transformation?

Representing data differently to analyze and separate it more easily.

machine_learning feature_transformation
How do SVMs solve non-linear datasets?

Use a kernel for transformation.

machine_learning svm
How do Neural Networks handle non-linear datasets?

Try to learn how to transform the feature space automatically.

machine_learning neural_networks
What is the Bias-Variance trade-off?

A balance between overfitting (high variance) and underfitting (high bias).

machine_learning bias_variance
What is Occam’s razor in ML?

Choose the simpler model if two models perform similarly.

machine_learning occams_razor
What does MSE stand for?

Mean Squared Error, measures average square distance between correct and predicted outputs.

machine_learning mse
Is 85% accuracy good?

Accuracy is relative; depends on baseline and upper bound performance.

machine_learning accuracy
What is the Baseline in performance evaluation?

The lower bound for performance, often chance/random performance.

machine_learning baseline
What is the Upper bound in performance evaluation?

The best case, often compared to human performance.

machine_learning upper_bound
What is K-Nearest Neighbours?

A lazy learner that stores data until a request is made.

machine_learning knn
What are Decision Trees in ML?

Eager learners that process all data upfront and discard it after analysis.

machine_learning decision_trees
What does the Nearest Neighbour Classifier do?

Classifies a test instance to the class label of the nearest training instance.

machine_learning nearest_neighbour_classifier
What does k-NN stand for?

k-nearest neighbours

machine_learning k-nn
What type of model is k-NN?

Non-parametric model

machine_learning model_types
What is a major problem with k-NN?

Sensitive to noise

machine_learning problems
What is the solution to overfitting in k-NN?

Use k nearest neighbours

machine_learning solutions
What does increasing k do to the classifier?

Makes the decision boundary smoother and less sensitive to training data

machine_learning k-nn
How should k be chosen in k-NN?

Using a validation dataset

machine_learning k-nn validation
What are some distance metrics used in k-NN?

Mahalanobis distance, Hamming distance

machine_learning distance_metrics
What does distance-weighted k-NN do?

Assigns weights to neighbours based on their distance

machine_learning k-nn weights
What happens if k=N in weighted k-NN?

It becomes a global method

machine_learning k-nn global_method
What is a disadvantage of k-NN for large datasets?

It can be slow

machine_learning k-nn performance
What is the curse of dimensionality in k-NN?

Distance metrics may not work well in high dimensional spaces

machine_learning curse_of_dimensionality
How does k-NN perform regression?

Computes the mean value across k nearest neighbours

machine_learning k-nn regression
What is the principle of decision trees?

Focus on a specific subset or feature to make decisions

machine_learning decision_trees
What type of learners are decision trees?

Eager learners

machine_learning decision_trees
What is decision tree learning?

A method for approximating discrete classification functions using a tree-based representation.

machine_learning decision_tree
How can decision trees be represented?

As a set of if-then rules.

machine_learning decision_tree
What type of search do decision tree learning algorithms use?

Top-down greedy search through the space of possible solutions.

machine_learning algorithms
Name some algorithms for constructing decision trees.

ID3, C4.5, CART.

machine_learning algorithms
What is the first step in the general decision tree algorithm?

Search for the optimal splitting rule on training data.

machine_learning decision_tree
What is the goal of finding an optimal split rule?

To create partitioned datasets that are more 'pure' than the original dataset.

machine_learning decision_tree
What does Information Gain measure?

The reduction of information entropy.

machine_learning id3
What does Gini Impurity measure?

The probability of incorrectly classifying a randomly picked point according to class label distribution.

machine_learning cart
What is Variance Reduction mainly used for?

Regression trees where the target variable is continuous.

machine_learning cart
Who introduced the concept of entropy in information theory?

Claude Shannon (1916-2001).

information_theory entropy
What does entropy measure?

The uncertainty of a random variable.

information_theory entropy
What is the formula for the amount of information required to determine the state of a random variable?

I(x) = log2(K).

information_theory entropy
How is the amount of information related to probability?

I(x) = -log2(P(x)).

information_theory probability
What happens to information required when the impostor is more likely in one box?

Low entropy; less new information is gained.

information_theory entropy
What is the information required when the impostor is equally likely in 4 boxes?

I(x) = -log2(1/4) = 2 bits.

information_theory entropy
What does low entropy indicate?

You don’t need to know a lot of information to predict the value of a random variable.

information_theory entropy
What does high entropy indicate?

A lot of new information is gained when predicting the value of a random variable.

information_theory entropy
What is the entropy of box 1?

0.0439 bits (LOW entropy)

entropy information
What is the entropy of box 2?

6.6439 bits (HIGH entropy)

entropy information
How is entropy defined?

Average amount of information: 𝐻(𝑋) = −∑ 𝑃(𝑥𝑘)𝑙𝑜𝑔2(𝑃(𝑥𝑘))

entropy definition
What is the continuous entropy formula?

𝐻(𝑋) = −∫ 𝑓(𝑥)𝑙𝑜𝑔2(𝑓(𝑥))𝑑𝑥

entropy continuous
What does a 50:50 split of information represent?

Average entropy of 1 (more random outcome)

entropy randomness
What is information gain?

Difference between initial entropy and weighted average entropy of subsets.

information gain
What is the formula for information gain?

𝐼𝐺(𝑑𝑎𝑡𝑎𝑠𝑒𝑡, 𝑠𝑢𝑏𝑠𝑒𝑡𝑠) = 𝐻(𝑑𝑎𝑡𝑎𝑠𝑒𝑡) − ∑ |𝑆|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆)

information gain
What is the binary tree information gain formula?

𝐼𝐺(𝑑𝑎𝑡𝑎𝑠𝑒𝑡, 𝑠𝑢𝑏𝑠𝑒𝑡𝑠) = 𝐻(𝑑𝑎𝑡𝑎𝑠𝑒𝑡) − (|𝑆𝑙𝑒𝑓𝑡|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆𝑙𝑒𝑓𝑡) + |𝑆𝑟𝑖𝑔ℎ𝑡|/|𝑑𝑎𝑡𝑎𝑠𝑒𝑡|𝐻(𝑆𝑟𝑖𝑔ℎ𝑡))

information gain
What are ordered values in decision trees?

Attribute and split point (e.g., weight < 60)

decision_trees ordered_values
What are categorical values in decision trees?

Search for the most informative feature, create branches for each value.

decision_trees categorical_values
What is the first step in using ID3 algorithm?

Find the entropy of the initial dataset.

id3 algorithm
What is the entropy of the dataset D with 9 positive and 5 negative outcomes?

𝐻(𝐷) = 0.940

entropy dataset
What is the entropy for 'sunny' outcomes?

𝐻(𝐷𝑠𝑢𝑛𝑛𝑦) = 0.971

entropy sunny
What is the entropy for 'overcast' outcomes?

𝐻(𝐷𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = 0

entropy overcast
What is the entropy for 'rain' outcomes?

𝐻(𝐷𝑟𝑎𝑖𝑛) = 0.971

entropy rain
What is the formula for information gain for 'outlook'?

IG(D, outlook) = H(D) - (5/14 H(Dsunny) + 4/14 H(Dovercast) + 5/14 H(Drain))

information_gain outlook
What is the total number of days in the dataset?

14 days

data total_days
What is the information gain for 'outlook'?

0.246

information_gain outlook
What happens to the 'overcast' subset?

It is labeled as a tick since all outcomes are positive (1).

decision_tree overcast
What is a common issue with decision trees?

They can overfit the data.

decision_trees overfitting
What is one method to deal with overfitting in decision trees?

Stopping early or pruning the tree.

overfitting pruning
What is the validation set size in cross-validation?

20% of the provided data.

cross_validation validation_set
What is the first step in pruning a decision tree?

Go through each internal node connected only to leaf nodes.

pruning decision_tree
What does a random forest consist of?

A collection of decision trees trained on different subsets of data.

random_forests decision_trees
What is the outcome of the algorithm in a random forest?

The majority vote by all the different trees.

random_forests algorithm_outcome
What do regression trees predict?

A real-valued number instead of a class label.

regression_trees prediction
What is used instead of information gain for regression trees?

Variance reduction.

regression_trees splitting_metric
How do you make predictions with regression trees?

By taking an average or weighted average of samples in the leaves.

regression_trees predictions
What is the purpose of taking an average in machine learning predictions?

To make predictions based on the distance of different samples in the leaves of the tree.

machine_learning prediction
What is the ultimate goal when creating machine learning systems?

To develop models that generalise to previously unseen examples.

machine_learning goals
What is a held-out test dataset used for?

To measure the performance of a model on unknown data.

machine_learning evaluation
Why is shuffling important before splitting a dataset?

To avoid implicit ordering in the dataset that can bias results.

machine_learning data_management
What are hyperparameters in machine learning?

Model parameters chosen before training, such as 'k' in k-NN.

machine_learning hyperparameters
What is the motivation behind hyperparameter tuning?

To choose hyperparameter values that give the best performance.

machine_learning hyperparameter_tuning
What is a disadvantage of testing hyperparameters on the training dataset?

It usually does not generalise well to unseen examples.

machine_learning evaluation
What should never be done when evaluating hyperparameters?

Using the test dataset to select hyperparameters based on accuracy.

machine_learning evaluation
What is the correct approach for dataset splitting in machine learning?

Split into training, validation, and test sets, e.g., 60:20:20.

machine_learning data_management
What is the purpose of the validation set?

To select the best hyperparameters based on accuracy.

machine_learning hyperparameter_tuning
What is hyperparameter tuning/optimisation?

Selecting parameters that produce the best classifier performance.

machine_learning hyperparameter_tuning
What can be done for final evaluation after hyperparameter tuning?

Optionally include the validation set back into the training set.

machine_learning evaluation
What can be included in the training set for final evaluation?

Validation set can be included to retrain the model on the whole dataset after finding best hyperparameters.

evaluation model hyperparameters
What is the purpose of including the validation set in training?

It provides more data for training, potentially increasing model performance.

training validation performance
When is the final evaluation done?

The final evaluation is done on the test dataset.

evaluation test dataset
What is a risk of developing and evaluating a model on the same data?

It results in overfitting the model to the training data.

overfitting model evaluation
What should the test set be used for?

The test set should only be used for estimating performance on unknown examples.

test performance evaluation
What is cross-validation used for?

Cross-validation is used when the dataset is small to ensure effective testing.

cross-validation testing dataset
What are the steps in cross-validation?
1. Divide dataset into k folds. 2. Use k-1 for training, 1 for testing. 3. Iterate k times.
cross-validation steps method
What does the global error estimate formula represent?

It averages performance metrics across all k held-out test sets.

error estimation metrics
What is important about cross-validation in model evaluation?

It evaluates an algorithm rather than a single trained instance of a model.

evaluation algorithm model
What is one option for parameter tuning during cross-validation?

Use 1 fold for testing, 1 for validation, and k-2 for training in each iteration.

parameter_tuning cross-validation training
What is an alternative method for parameter tuning in cross-validation?

Cross-validation within cross-validation, separating 1 fold for testing.

parameter_tuning cross-validation method
How does the second option for parameter tuning help?

It allows for optimal hyperparameters to be found using more data.

parameter_tuning hyperparameters data
What is the advantage of using different hyperparameters on each fold during cross-validation?

It likely leads to the best results for small data sets.

cross-validation hyperparameters advantages
What is a disadvantage of using different hyperparameters on each fold?

It requires more work and experiments than simpler methods and is not practical in all situations due to high computation needs.

cross-validation hyperparameters disadvantages
What is the advantage of testing on all data when going into production?

You can use all available data to train the model for better performance.

production testing advantages
What is a disadvantage of testing on all data?

You cannot estimate the performance of the final trained model anymore; you rely on hyperparameters generalizing.

production testing disadvantages
What are the steps in CASE 1 for plenty of data available?
1. Train on training set. 2. Tune on validation set. 3. Estimate performance using the test set.
parameter_optimisation performance_estimation steps
What are the steps in CASE 2 for limited data available?
1. Separate dataset into k folds. 2. Use 1 fold for testing, k-1 for training/validation. 3. Repeat k times. 4. Average results for performance estimation.
parameter_optimisation performance_estimation steps
What does a confusion matrix represent?

It visualizes performance, showing true labels vs. predicted labels, allowing analysis of model performance.

evaluation_metrics confusion_matrix performance
What is accuracy in model evaluation?

Accuracy = (TP + TN) / (TP + TN + FP + FN).

evaluation_metrics accuracy formulas
How is classification error calculated?

Classification error = 1 - accuracy.

evaluation_metrics classification_error formulas
What is precision in model evaluation?

Precision = TP / (TP + FP). It measures the correctness of positive predictions.

evaluation_metrics precision formulas
What does high precision indicate?

If a model predicts something as positive, it is likely to be correct.

evaluation_metrics precision interpretation
What is recall in model evaluation?

Recall = TP / (TP + FN). It measures the ability to find all positive examples.

evaluation_metrics recall formulas
What is the precision for Class 1?

60%

precision class1
What is the formula for recall?

Recall = \( \frac{TP}{TP + FN} \)

recall formula
What is the recall for Class 1?

75%

recall class1
What does high recall indicate?

Good at retrieving positive examples, but may include false positives.

recall performance
What is the trade-off between precision and recall?

High precision often leads to low recall and vice versa.

precision recall tradeoff
What is macro-averaged recall for two classes?

62.5%

macro-averaging recall
What does the F-measure combine?

It combines precision and recall into a single score.

f-measure metrics
What is the formula for F1 score?

\( F1 = \frac{2 \cdot precision \cdot recall}{precision + recall} \)

f1 formula
What does a confusion matrix evaluate?

It evaluates performance in multi-class classification.

confusion_matrix evaluation
What is accuracy in classification?

Accuracy = \( \frac{Number \ of \ correctly \ classified \ examples}{Total \ number \ of \ examples} \)

accuracy classification
What is the difference between micro-averaging and macro-averaging?

Macro-averaging averages metrics at the class level; micro-averaging at the item level.

averaging metrics
What is the effect of micro-averaging on precision, recall, and F1 in binary and multi-class classification?

They equal accuracy.

micro-averaging accuracy
What is micro-averaged precision, recall, and F1 equal to?

Accuracy

metrics classification
What is the most common evaluation metric for regression tasks?

Mean Squared Error (MSE)

regression metrics
How is MSE calculated?

MSE = rac{1}{N} imes ext{sum}((Y_i - ilde{Y}_i)^2)

regression mse
What does a lower MSE indicate?

Better predictions

regression mse
What does RMSE stand for?

Root Mean Squared Error

regression rmse
How is RMSE calculated?

RMSE = ext{sqrt}(MSE)

regression rmse
What are the five important model characteristics in ML?

Accurate, Fast, Scalable, Simple, Interpretable

ml model characteristics
What is a balanced dataset?

Equal number of examples in each class

data dataset
What is an imbalanced dataset?

Classes are not equally represented

data dataset
What can affect accuracy in imbalanced datasets?

Performance of the majority class

metrics accuracy
What does macro-averaged recall help detect?

If one class is completely misclassified

metrics recall
What is a solution for imbalanced test sets?

Normalize counts in the confusion matrix

data solution
What does a normalized confusion matrix achieve?

Calculates metrics as if evaluated on a balanced dataset

data normalization
What is one view of system performance on a balanced test set?

The classifier's performance remains the same.

performance classifier
What should be evaluated for a more realistic scenario?

The system should be evaluated with data having a realistic distribution.

evaluation realistic
What is one solution to balance classes?

Down-sample the majority class.

solutions balancing
What is another solution to balance classes?

Up-sample the minority class.

solutions balancing
What does overfitting indicate about model performance?

Good performance on training data, but poor generalization to other data.

overfitting performance
What does underfitting indicate about model performance?

Poor performance on both training and test data.

underfitting performance
What happens to classification error as models learn?

Classification error decreases for training but may increase for test data.

classification error
What can cause overfitting?

A model that is too complex or training data that is not representative.

overfitting causes
How can we fight overfitting?

Choose optimal hyperparameters and use regularization.

overfitting solutions
What is a confidence interval?

A way to quantify confidence in an evaluation result.

confidence evaluation
What affects confidence in an evaluation result?

The size of the test set.

confidence testset
What is the impact of a small test set on accuracy?

90% accuracy on 10 samples differs from 84% accuracy.

accuracy testset
What affects confidence in evaluation results?

The size of the test set affects confidence in evaluation results.

confidence evaluation
What is true error?

True error is the probability that the model misclassifies a randomly drawn example from a distribution.

error model
How is true error mathematically defined?

True error is defined as: 𝑒𝑟𝑟𝑜𝑟𝐷(ℎ) ≡ Pr[𝑓(𝑥) ≠ ℎ(𝑥)].

error mathematics
What is sample error?

Sample error is the classification error based on a sample from the underlying distribution.

error sample
How is sample error mathematically defined?

Sample error is defined as: 𝑒𝑟𝑟𝑜𝑟𝑆(ℎ) ≡ (1/N) ∑ 𝛿(𝑓(𝑥), ℎ(𝑥)) for x ∈ S.

error mathematics
What does 𝛿(𝑓(𝑥), ℎ(𝑥)) represent?

𝛿(𝑓(𝑥), ℎ(𝑥)) = 1 if f(x) ≠ h(x), 0 if f(x) = h(x).

error classification
What is a confidence interval?

An N% confidence interval is an interval that is expected with probability N% to contain the parameter q.

confidence interval
What does a 95% confidence interval [0.2, 0.4] mean?

It means that with probability 95%, the true parameter q lies between 0.2 and 0.4.

confidence interval
How does sample size affect confidence intervals?

As sample size n increases, confidence interval boundaries get closer to 0, leading to narrower intervals.

confidence sample_size
What is the example confidence interval for errorS(h) = 0.22 with n = 50?

With n = 50 and ZN = 1.96, the confidence interval for errorD(h) is quite large (22%).

confidence example
What does statistical significance testing help determine?

Statistical significance testing helps determine if there is a difference between two distributions of classification errors.

significance testing
What does a graph with overlapping distributions indicate?

Overlapping distributions indicate uncertainty about which classifier is better due to sampling error.

graphs distributions
What is the Marek ApprovedTM test?

The Marek ApprovedTM test is the Randomisation test, considered intuitive for comparing algorithms.

testing algorithms
What is the Marek ApprovedTM test?

The Marek ApprovedTM test is the Randomisation test, as it is the most intuitive to him.

statistics testing
What do statistical tests determine?

Statistical tests tell us if the means of two sets are significantly different.

statistics tests
Name three statistical tests mentioned.

Randomisation, T-test, Wilcoxon rank-sum.

statistics tests
How does the Randomisation test work?

It randomly switches predictions between two systems and measures if the performance difference is greater or equal to the original difference.

statistics randomisation
What does a small p-value indicate?

A small p-value means we can be more confident that one system is different from the other.

statistics p-value
What is the null hypothesis?

The null hypothesis states that the two algorithms/models perform the same and differences are due to sampling error.

statistics hypothesis
What is the significance level for performance difference?

Performance difference is statistically significant if p < 0.05 (5%).

statistics significance
What is P-hacking?

P-hacking is the misuse of data analysis to find patterns that appear statistically significant without an underlying effect.

statistics p-hacking
What happens if the number of experiments increases in P-hacking?

Increasing experiments can lead to a higher false discovery proportion, even if true discoveries remain the same.

statistics p-hacking
What is the false positive rate in the example of P-hacking?

P(false positive) = 0.05, the same as the significance level.

statistics false_positive
What is the false discovery proportion in the initial example?

The false discovery proportion is 35 / 115 = 30%.

statistics false_discovery
What happens to the false discovery proportion when experiments increase to 2400?

The false discovery proportion increases to 80 / 195 = 59%.

statistics false_discovery
How many true discoveries were made?

80 true discoveries

statistics research
How many false discoveries were made?

115 false discoveries

statistics research
What is the false discovery proportion?

59%

statistics research
What is the sample size of the 'study'?

54 people

statistics sample_size
How many possible relations were searched in the 'study'?

27,716 possible relations

statistics research
What is a method to defend against unintentional p-hacking?

Adaptive threshold for calculating p-value (Benjamini & Hochberg, 1995)

statistics p-hacking
What is the first step in the Benjamini-Hochberg method?

Rank the p-values from the M experiments

statistics p-hacking
What does the Bejamini-Hochberg critical value formula represent?

New significance threshold (critical value)

statistics p-hacking
What is the original significance threshold in the Benjamini-Hochberg method?

5%

statistics p-hacking
What is the downside of the Benjamini-Hochberg method?

Thresholds for most experiments will be lower than the original 5%

statistics p-hacking
What are Artificial Neural Networks (ANNs)?

A class of ML algorithms optimized with gradient descent

machine_learning neural_networks
What does Deep Learning refer to?

Using neural network models with multiple hidden layers

machine_learning neural_networks
Why has deep learning become more popular now?

Better conditions for implementation, like big data and faster hardware

machine_learning neural_networks
What are perceptrons?

An early version of neural networks proposed in 1958 by Rosenblatt

machine_learning neural_networks
What is backpropagation?

Described in 1974 by Werbos, it is a training algorithm for neural networks

machine_learning neural_networks
What are LSTMs and CNNs?

Key components of modern neural network architectures described in the late '90s

machine_learning neural_networks
What is a benefit of having large datasets for neural networks?

They improve training efficiency and effectiveness

machine_learning data
What advancements have improved neural network training?

Better CPUs and GPUs for efficient computation

machine_learning hardware
What operations can be efficiently parallelized on graphics cards?

Matrix operations

graphics computing
What has improved the accessibility and affordability of graphics cards?

Increased efficiency and reduced cost

graphics cost
What are automatic differentiation libraries used for?

They handle back propagation and optimization of model parameters

software differentiation
What is linear regression useful for in machine learning?

It serves as a stepping stone towards neural network models

machine_learning linear_regression
What type of learning is linear regression?

Supervised learning

learning supervised
What does the dataset in supervised learning consist of?

Input and output pairs

dataset supervised
What is the goal of supervised learning?

Learn the mapping f: X → Y

goal learning
What does the function f represent in linear regression?

The mapping from inputs to outputs

function linear_regression
What are the desired labels in classification problems?

Discrete labels

classification labels
What are the desired labels in regression problems?

Continuous labels

regression labels
What controls the gradient of a straight line in linear regression?

The parameter 'a'

linear_regression gradient
What does the parameter 'b' represent in linear regression?

The y-intercept

linear_regression intercept
What does the loss function measure in linear regression?

How well we are performing on our dataset

loss performance
What is the formula for the loss function in linear regression?

E = (1/2) * Σ(ŷ(i) − y(i))^2

loss formula
What does a smaller value of E indicate?

Predictions are close to real values

e_value predictions
What do derivatives show in the context of linear regression?

How to change each parameter value to reduce loss

derivatives parameters
What is the purpose of gradient descent?

To repeatedly update parameters a and b

gradient_descent optimization
What does the learning rate (α) control in gradient descent?

The step size for updating parameters

learning_rate gradient_descent
What is the learning rate in gradient descent?

The learning rate, denoted as 𝛼, is a hyperparameter that determines the size of the steps taken towards the minimum of the loss function.

machine_learning hyperparameter
What does 𝜕𝐸/𝜕𝑎 represent?

It represents the partial derivative of the loss function with respect to parameter 𝑎.

calculus derivative
What is the formula for updating parameter 𝑎?

The update rule is: 𝑎𝑛𝑒𝑤 := 𝑎𝑜𝑙𝑑 - 𝛼 ∑(ax(𝑖) + 𝑏 - 𝑦(𝑖))𝑥𝑖/N, where N is the total number of data points.

machine_learning gradient_descent
What does an epoch represent in machine learning?

An epoch is one complete pass over the entire dataset during training.

machine_learning training
What is the gradient in vector notation?

The gradient is a vector of all partial derivatives for a function with K parameters: ∇𝜃f(𝜃) = [𝜕f(𝜃)/𝜕𝜃1, 𝜕f(𝜃)/𝜕𝜃2, ..., 𝜕f(𝜃)/𝜕𝜃𝐾].

calculus gradient
What is the analytical solution for linear regression?

The analytical solution allows finding optimal parameters without iterating through epochs by solving a specific equation.

machine_learning linear_regression
What is the complexity of matrix inversion?

Matrix inversion has cubic complexity, making it computationally expensive for large problems.

computational_complexity matrix
What is multiple linear regression?

Multiple linear regression uses multiple input features, each with its own parameter, to predict an output value.

machine_learning linear_regression
How does the RMSE change with multiple features?

The RMSE (Root Mean Square Error) is typically lower with multiple features due to increased information for prediction.

machine_learning evaluation
What is RMSE in model evaluation?

Root Mean Square Error (RMSE) measures the differences between predicted and observed values; lower RMSE indicates better model accuracy.

modeling evaluation
How does using more features affect model predictions?

Using more features provides more information, leading to more accurate predictions in the model.

features accuracy
What does a linear regression model represent in higher dimensions?

In higher dimensions, the linear regression model is a continuous linear plane representing the learned data.

linear_regression dimensions
What is the role of the nucleus in a biological neuron?

The nucleus acts like the neuron's brain, telling it what to do.

biology neurons
What do dendrites do in a biological neuron?

Dendrites connect to other neurons and receive signals from them.

biology neurons
What happens when a biological neuron's axon fires?

When conditions are right, the axon fires a signal to connect with other neurons' dendrites.

biology neurons
What are input features in an artificial neuron?

Input features (xi) are the values fed into the artificial neuron, each with an associated weight (θi).

artificial_neuron features
What determines the importance of a feature in an artificial neuron?

The weight (θi) associated with each input feature determines its importance in the artificial neuron.

artificial_neuron weights
What does the output of an artificial neuron involve?

The output involves multiplying features and weights, and adding the bias (b).

artificial_neuron output
What is the activation function in an artificial neuron?

The activation function (g) transforms the output of the linear equation into a new value.

artificial_neuron activation_function
How can the bias term be included in the equation?

The bias term can be included by reformulating the equation to add an extra feature and weight for the bias.

artificial_neuron bias
What is the vector notation for input features and weights?

Input features and weights can be represented as vectors: x = [x1, x2, ..., xK], W = [θ1, θ2, ..., θK].

vector_notation features
What is the logistic activation function used for?

The logistic function (sigmoid) squashes any value into a range between 0 and 1.

activation_function logistic_function
What does logistic regression actually do?

Logistic regression performs binary classification using the logistic function, not actual regression.

logistic_regression classification
How is the logistic regression model optimized?

The logistic regression model is optimized using gradient descent.

logistic_regression optimization
What is a perceptron?

A perceptron is an algorithm for supervised binary classification, an early version of an artificial neuron.

perceptron classification
What activation function does a perceptron use?

A perceptron uses a threshold function as its activation function, outputting 0 until a certain limit is reached.

perceptron activation_function
What does gradient descent use as its activation function?

A threshold function that outputs 0 until a limit (θ) is reached, then outputs 1.

gradient_descent activation_function
What is the output of the activation function in the perceptron?

1 if WT x > 0, otherwise 0.

perceptron activation_function
What is the perceptron learning rule update formula?

θ_new ← θ_old + α(y - h(x))xi

perceptron learning_rule
What happens when y = 1 and h(x) = 0?

Weight θi is increased if xi is positive, decreased if negative.

perceptron weight_update
What happens when y = 0 and h(x) = 1?

We want to decrease the summation, so we do the opposite to reduce WT x.

perceptron weight_update
What types of functions can a perceptron learn?

Any linearly separable function, like logical OR.

perceptron functions
Why can't a perceptron learn XOR?

XOR is not linearly separable; one linear line cannot separate the classes.

perceptron xor
What is a weakness of using a single neuron?

It cannot classify complex relationships like XOR.

perceptron weaknesses
What is needed to model complex relationships in data?

Multi-layer neural networks are required.

neural_networks complex_relationships
What is a multi-layer perceptron (MLP)?

A network that connects neurons in sequence to learn higher order features.

mlp neural_networks
What is the role of hidden layers in a neural network?

They process features and are not visible from the outside.

hidden_layers neural_networks
What does each block in a block diagram represent?

A layer of the model with multiple neurons.

block_diagram neural_networks
What is the first and last layer of a neural network called?

The first layer is the input layer and the last is the output layer.

neural_networks layers
What should you check when something isn’t working in a neural network?

Ensure that the matrix dimensions match.

troubleshooting neural_networks
What is b in the context of a neural network layer?

The layer-specific bias vector, unique to each neuron in a layer.

bias neural_networks
What should you check when working with matrices?

Matrix dimensions must match.

matrices dimensions
What is 'b_' in a neural network?

Layer-specific bias vector for each neuron.

neural_networks bias
How many neurons are typically in deep neural networks?

Thousands or millions of neurons.

neural_networks neurons
What can multi-layer neural networks learn?

Useful representations and features.

neural_networks features
What was the approach to feature crafting before multi-layer networks?

Manually crafting features for pattern recognition.

pattern_recognition features
What is end-to-end learning?

Allowing the network to learn features from raw input.

machine_learning end-to-end
What do lower levels of a neural network act as?

Feature extractors.

neural_networks feature_extraction
What do higher levels of a neural network learn?

Act as the classification layer.

neural_networks classification
What is the benefit of training both feature extraction and classification layers together?

They optimize each other based on data.

optimization training
What should you use if the data is linearly separable?

A linear function for the model.

activation_functions linear
What happens if we only use linear activation functions in a multi-layer network?

It becomes equivalent to a single-layer network.

activation_functions linear
What is the simplest activation function?

Linear activation (identity function).

activation_functions linear
What does the output of a neuron with linear activation become?

ŷ = f(𝑊𝑇𝑥) = 𝑊𝑇𝑥.

activation_functions output
What is the equation for output in a two-layer network?

ŷ = W1(𝑊2𝑥) → 𝑦 = 𝑈𝑥 where 𝑈 = 𝑊1𝑊2.

neural_networks equation
What is the equation for a linear activation function?

ŷ = W1(W2x) → y = Ux where U = W1W2

activation linear
What happens when a two-layer network uses linear activation?

It collapses into a single-layer network, unable to capture complex non-linear patterns.

network linear
What do non-linear activation functions do?

They allow models to learn complicated patterns by breaking the dependency of multiple layers collapsing into one.

activation non-linear
What is the range of the sigmoid activation function?

The sigmoid function compresses output into the range between 0 and 1.

activation sigmoid
What is the formula for the sigmoid activation function?

f(x) = σ(x) = 1 / (1 + e^(-x))

activation sigmoid
What is the range of the tanh activation function?

The tanh function maps input values to the range -1 to 1.

activation tanh
What is the formula for the tanh activation function?

f(x) = tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

activation tanh
What characterizes the ReLU activation function?

ReLU is linear and unbounded in the positive part, but non-linear overall.

activation relu
What is the formula for the ReLU activation function?

f(x) = ReLU(x) = { 0 for x ≤ 0; x for x > 0 }

activation relu
What does the softmax activation function do?

It scales inputs into a probability distribution that sums to 1.

activation softmax
What is the formula for the softmax activation function?

softmax(zi) = e^(zi) / ∑ e^(zk)

activation softmax
What is a common activation function for deep neural networks?

ReLU is commonly used in very deep neural networks, especially for image recognition.

activation relu
Which activation functions are more robust than ReLU?

Tanh and sigmoid are more robust than ReLU.

activation robustness
What is a potential issue with using ReLU?

ReLU can produce unbounded values, leading to confusion in the network.

activation relu
What should you try first when designing models?

Experiment with tanh and sigmoid first, as they are bounded.

activation modeling
How should the choice of activation function in hidden layers be treated?

It is a hyperparameter that can be set empirically or optimized using a development set.

activation hyperparameter
How can we set hyperparameters for activation functions?

Empirically or using a development set to find the best performing function for the model and dataset.

hyperparameters activation_functions
What determines the choice of activation function in the output layer?

It depends on the task.

activation_function output_layer
What activation function is commonly used for binary classification?

Sigmoid is most common; tanh can also be used.

binary_classification activation_function
What activation function should be used for predicting unbounded scores?

Use a linear activation function.

unbounded_scores activation_function
What activation function is most commonly used for predicting a probability distribution?

Softmax is used for multi-class classification.

probability_distribution softmax
What does Softmax do?

It scales values into a probability distribution, making them sum to 1.

softmax probability_distribution
What is the input dimension for the neural network in PyTorch?

The input dimension is 10.

pytorch neural_network input_dimension
How many neurons are in the hidden layer of the PyTorch neural network?

There are 5 neurons in the hidden layer.

pytorch neural_network hidden_layer
What is the output dimension of the PyTorch neural network?

The output dimension is 1.

pytorch neural_network output_dimension
What activation function is applied in the hidden layer during the forward pass?

Tanh is used as the activation function.

forward_pass activation_function tanh
What is the purpose of the loss function in neural networks?

To minimize and show performance on a specific task.

loss_function optimization
How do we update parameters in neural networks?

Using gradient descent to minimize the loss function.

gradient_descent parameter_update
What is the formula for updating parameters in gradient descent?

\( \theta_i^{(t+1)} = \theta_i^{(t)} - \alpha \frac{\partial E}{\partial \theta_i^{(t)}} \)

gradient_descent formula
What type of task is a regression task?

Predicting a continuous variable, like velocity or price.

regression continuous_variable
What is the goal of a regression task?

To predict a continuous variable.

regression prediction
What is an example of a regression task?

Predicting the price of a house.

regression example
What activation function is often used in the output layer for regression?

Linear activation.

activation regression
What loss function is commonly used in regression?

Mean Squared Error (MSE).

loss regression
What is the formula for Mean Squared Error (MSE)?

MSE = \frac{1}{N} \sum_{i=1}^{N}(\hat{y}_i - y_i)^2

loss mse
What does MSE equal when predictions are correct?

0.

mse accuracy
What is the primary goal of classification tasks?

To choose between different categories or discrete options.

classification tasks
What is binary classification?

Classification with only 2 possible classes.

classification binary
What is multi-class classification?

Classification with more than 1 class, where each input belongs to exactly 1 class.

classification multi-class
What is multi-label classification?

Each input can belong to multiple classes.

classification multi-label
What is the loss function used in classification?

Cross-entropy.

loss classification
What do we want to maximize in classification?

The likelihood of the network assigning correct labels.

classification likelihood
What is the probability formulation for binary classification?

\prod_{i=1}^{N} (\hat{y}^{(i)})^{y^{(i)}} (1 - \hat{y}^{(i)})^{(1-y^{(i)})}

binary classification
What happens if the network assigns the correct label for every data point?

The product approaches 1.

classification accuracy
What is the issue with multiplying probabilities in classification?

It can lead to underflow errors.

classification underflow
How can we avoid underflow errors in classification?

By maximizing the logarithm of the probability formula.

classification underflow
What is the formula for binary cross-entropy loss?

-\sum_{i=1}^{N} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]

loss binary_cross-entropy
What is the formula for binary cross-entropy loss?

\( L = -\frac{1}{N} \sum_{i=1}^{N} [y(i) \log(\hat{y}(i)) + (1 - y(i)) \log(1 - \hat{y}(i))] \)

loss binary cross-entropy
What does normalizing by the number of data points do in loss calculation?

It makes the loss magnitude independent of the number of data points.

normalization loss
What is categorical cross-entropy?

It generalizes binary cross-entropy for multiple classes.

loss categorical cross-entropy
What is the formula for categorical cross-entropy loss?

\( L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_c(i) \log(\hat{y}_c(i)) \)

loss categorical cross-entropy
In categorical cross-entropy, what does y_c represent?

1 if C is the correct class for data point i, 0 otherwise.

classification categorical cross-entropy
What is the output layer configuration for a multi-class classification neural network example?

An output layer with 3 neurons predicting probabilities over 3 flower types.

neural_networks multi-class classification
What activation function is commonly used with categorical cross-entropy loss?

Softmax activation.

activation softmax loss
What is batching in neural networks?

Combining vectors of several data points into a matrix for simultaneous processing.

batching neural_networks
Why is batching beneficial for training neural networks?

It increases speed and reduces noise, leveraging GPU efficiency.

batching training efficiency
What does batching allow GPUs to do more efficiently?

Perform matrix multiplications in parallel.

gpus efficiency batching
How does batching assist in regularization during optimization?

It combines updates from several data points easily.

regularization optimization batching
What is the benefit of batching in neural networks?

Combines updates from several datapoints, making updates more stable and accurate.

neural_networks batching
What is the input matrix X in a neural network?

A batch of data points with dimensions n x k, where n is data points and k is feature vectors.

neural_networks input_matrix
What does the first layer in a neural network apply?

A linear transformation using a weight matrix and adding a bias.

neural_networks first_layer
What is Z in the context of a neural network?

The output matrix after applying the weight matrix and bias in the first layer.

neural_networks output_matrix
What do we get after applying the activation function to Z?

A, the output of the first hidden layer.

neural_networks activation_function
What is the purpose of calculating loss in a neural network?

To determine how well the model performs.

neural_networks loss
What method is used to update model parameters in neural networks?

Gradient descent is used to update weight matrices and biases.

neural_networks gradient_descent
What is backpropagation in neural networks?

A method to calculate necessary partial derivatives iteratively.

neural_networks backpropagation
How does backpropagation simplify calculations?

It breaks down calculations into smaller steps, moving backwards through the network.

neural_networks calculations
What is the chain rule used for in neural networks?

To calculate the derivative of a composite function.

neural_networks chain_rule
What does the chain rule formula represent?

It shows how to break down derivatives into smaller parts for easier calculation.

neural_networks chain_rule
How can we find the partial derivative of the loss with respect to W[1]?

By breaking it down through Z[1] and A[1] using their respective derivatives.

neural_networks partial_derivatives
What are the two types of partial derivatives in backpropagation?

The output of an activation function w.r.t its input and the output of a linear transformation w.r.t its input.

neural_networks partial_derivatives
What is the purpose of the partial derivative in backpropagation?

To update the weights of the linear transformation in the neural network.

backpropagation neuralnetworks
What does the partial derivative of a matrix w.r.t another matrix represent?

A 4-D tensor containing the partial derivatives of every element in the first matrix w.r.t every element in the second.

mathematics tensor
What is the linear transformation notation used in backpropagation?

Z = XW, where Z is the output, X is input, and W is weights.

backpropagation notation
What do you need to calculate to update weights in a linear transformation?

The partial derivative of the loss w.r.t the weights and the bias vector.

backpropagation weights
What is the shape of the partial derivative of a scalar w.r.t a matrix?

It has the same shape as the original matrix itself.

mathematics derivatives
What is the key component in the derivatives during backpropagation?

The partial derivative of the loss w.r.t the output of the linear transformation.

backpropagation loss
What does backpropagation iteratively calculate?

Partial derivatives, taking them from the top layers and passing them down.

backpropagation calculation
What is the bias vector used for in backpropagation?

It is repeated for each neuron in the layer to add the same bias to each input vector.

backpropagation bias
What is necessary for lower levels to calculate their own partial derivatives?

The gradient of the loss w.r.t the input and the weight's partial derivative.

backpropagation gradient
What rule is used to break down the calculations in backpropagation?

The chain rule.

backpropagation chainrule
What is the significance of the dimensions N, D, and M in backpropagation?

They represent the number of inputs, dimensions, and outputs respectively.

backpropagation dimensions
What happens during the forward pass in a neural network?

The operation takes X and W as inputs and produces output Z.

neuralnetworks forwardpass
What does the partial derivative of the loss w.r.t one element depend on?

It depends on the weights it multiplies with and the loss of whatever uses this element.

backpropagation loss derivative
How many output values does the particular element affect?

It affects exactly 3 output values: z1,1, z1,2, and z1,3.

neural_network output_values
What is the equation for the partial derivative of the element?

The equation uses the chain rule and involves the weight w1,1 and the partial derivative of z1,1 w.r.t x1,1.

chain_rule equation
What happens when you calculate the partial derivative w.r.t the full matrix X?

It can be expressed as a dot product of two matrices.

matrix partial_derivative
What do the two matrices in the dot product represent?

The first is the partial derivative of the loss w.r.t Z, and the second is the transposed weight matrix for the layer.

dot_product matrices
What is the importance of backpropagation for inputs X?

It is a simple way of calculating backpropagation for inputs in a given layer.

backpropagation inputs
How do we calculate the partial derivative w.r.t the weights?

By breaking it down for one individual weight, considering its effect on the output.

weights partial_derivative
What does one weight affect in the output?

One weight affects two values in the output for two data points in the batch.

weights output
What is the equation for the partial derivative of the loss w.r.t the weights?

It is a dot product of the partial derivative of the loss w.r.t Z and the transposed matrix of features XT.

weights dot_product
What do we need to calculate for the bias vector?

The partial derivative of the loss w.r.t the bias vector.

bias partial_derivative
What result do we get for the partial derivative of the loss w.r.t the bias?

It is equal to a transposed column vector of 1s times the partial derivative of the loss w.r.t z.

bias loss
What is needed to perform full backpropagation through the neural network?

How to handle the activation functions.

backpropagation activation_functions
How are activation functions generally applied?

They are applied element-wise.

activation_functions element-wise
What is the purpose of activation functions in a neural network?

Activation functions are applied element-wise to introduce non-linearity, allowing the network to learn complex patterns.

neural_networks activation_functions
Do activation functions have parameters that need updating during training?

No, activation functions generally do not have parameters that need to be updated during training.

neural_networks activation_functions
What is the derivative of an activation function denoted as?

The derivative of an activation function is denoted as g′(x).

neural_networks activation_functions
What does the chain rule help with in back propagation?

The chain rule helps calculate the partial derivative of the loss with respect to the inputs of the activation function.

neural_networks backpropagation
What is the derivative of the Linear activation function?

For Linear: g(z) = z, g′(z) = 1.

activation_functions linear
What is the formula for the Sigmoid activation function?

For Sigmoid: g(z) = 1/(1 + e^(-z)), g′(z) = g(z)(1 - g(z)).

activation_functions sigmoid
What is the formula for the Tanh activation function?

For Tanh: g(z) = (e^z - e^(-z))/(e^z + e^(-z)), g′(z) = 1 - g(z)².

activation_functions tanh
What is the ReLU activation function and its derivative?

For ReLU: g(z) = z for z > 0, 0 for z ≤ 0; g′(z) = 1 for z > 0, 0 for z ≤ 0.

activation_functions relu
How is Softmax different from other activation functions?

Softmax takes a whole vector as input and outputs a whole vector, unlike other activation functions applied element-wise.

activation_functions softmax
What is the purpose of combining Softmax with cross-entropy?

Combining Softmax with cross-entropy simplifies the backpropagation of derivatives for classification tasks.

neural_networks softmax cross_entropy
What does the joint partial derivative through Softmax and cross-entropy represent?

It represents the predictions minus the true class labels, normalized by N if applicable.

neural_networks softmax cross_entropy
What is gradient descent?

Gradient descent is an optimization algorithm that updates parameters by taking small steps in the negative direction of the gradient.

optimization gradient_descent
What is the formula for updating weights in gradient descent?

W_new = W_old - α * (∂L/∂W), where α is the learning rate.

optimization gradient_descent
What is the learning rate in gradient descent?

The learning rate (α) is a hyperparameter that determines the step size for updating model parameters.

optimization learning_rate
What is the formula for updating weights in gradient descent?

𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝛼 \( \frac{\partial L}{\partial W} \)

gradient_descent formula
What does α represent in gradient descent?

Learning rate/step size, a hyperparameter based on the development set.

hyperparameter learning_rate
What must be true for gradients to be computed in neural networks?

Network functions and the loss need to be differentiable.

neural_networks gradients
What is the first step in the general algorithm for gradient descent?

Initialise weights randomly.

algorithm gradient_descent
What is the termination condition in gradient descent?

When the loss function does not improve anymore.

termination gradient_descent
What is a common issue when updating weights during backpropagation?

Updating weights before finishing using original weights can cause errors.

backpropagation errors
What is Stochastic Gradient Descent (SGD)?

Calculating the gradient based on one data point and updating weights immediately.

sgd gradient_descent
What are the steps in Stochastic Gradient Descent?
1. Initialise weights randomly.
2. Loop until convergence: a) Loop over each datapoint: i) Compute gradient based on the datapoint. ii) Update weights.
sgd algorithm
What is Mini-batched Gradient Descent?

A balance between batch and stochastic gradient descent, using batches of data points.

mini-batch gradient_descent
What are the steps in Mini-batched Gradient Descent?
1. Initialise weights randomly.
2. Loop until convergence: a) Loop over batches of data points: i) Compute gradient based on the batch. ii) Update weights.
mini-batch algorithm
What is a challenge in optimising neural networks?

Finding the lowest point on complex loss surfaces is difficult.

optimisation neural_networks
Why is the learning rate important?

The size of the learning rate significantly affects the training process.

learning_rate importance
What happens if the learning rate is too low?

Optimization can take a very long time to reach a good minimum.

learning_rate optimization
What happens if the learning rate is too high?

We can step over the correct solution.

learning_rate optimization
What is the ideal state of the learning rate?

It allows reaching the minimum of the loss function in a reasonable number of steps.

learning_rate optimization
What is the learning rate?

A hyperparameter that needs to be chosen based on the development set.

learning_rate hyperparameter
What are adaptive learning rates?

Different learning rates for each parameter in the model.

adaptive_learning hyperparameter
What happens if a parameter has not been updated for a while?

The learning rate for that parameter may be increased.

adaptive_learning parameters
What happens if a parameter is making big updates?

The learning rate for that parameter may be decreased.

adaptive_learning parameters
What algorithms work well for adaptive learning rates?

The 'Adam' and 'AdaDelta' algorithms.

adaptive_learning algorithms
What is learning rate decay?

Scaling the learning rate by a value between 0 and 1.

learning_rate_decay hyperparameter
What is the intuition behind learning rate decay?

Take smaller steps as we approach the minimum to avoid overshooting.

learning_rate_decay optimization
When can learning rate decay be performed?

Every epoch, after a certain number of epochs, or when validation performance doesn't improve.

learning_rate_decay strategies
What is the simplest approach to weight initialization?

Setting weights to zeros.

weight_initialization neural_networks
Why should we not set all weights to zero?

Neurons will learn the same things, leading to the same optimized values.

weight_initialization neural_networks
What is a common method for weight initialization?

Drawing randomly from a normal distribution with mean 0 and variance 1 or 0.1.

weight_initialization normal_distribution
What does Xavier Glorot initialization do?

Draws values from a uniform distribution based on the number of neurons in layers.

weight_initialization xavier_glorot
What is the formula used in Xavier Glorot initialization?

Weights are drawn from a uniform distribution defined by boundaries involving the number of neurons.

weight_initialization xavier_glorot
What role does randomness play in neural networks?

It is important for various aspects of the learning process.

randomness neural_networks
What role does randomness play in neural networks?

Different random initialisations lead to different results and performance.

neural_networks randomness
What is the solution to controlling randomness in neural networks?

Explicitly set the random seed for all random number generators used.

neural_networks randomness
What can happen when processes are parallelised on GPUs?

They can produce randomly different results due to different threads running at different times.

neural_networks gpus
How should you report model performance under different random seeds?

Report the mean and standard deviation of the performance.

neural_networks performance
What is min-max normalisation?

Scaling the smallest value to a and the largest to b, e.g., [0, 1] or [-1, 1].

normalisation data_processing
What is the formula for min-max normalisation?

X′ = a + (X - Xmin)(b - a) / (Xmax - Xmin)

normalisation formulas
What is standardisation (z-normalisation)?

Scaling the data to have mean 0 and standard deviation 1.

normalisation data_processing
What is the formula for standardisation?

X′ = (X - μ) / σ

normalisation formulas
Why is normalisation important in neural networks?

It helps weight updates to be proportional to the input, improving model learning accuracy.

normalisation neural_networks
What should you remember about normalisation for data columns?

Normalise each column separately, not the entire matrix.

normalisation data_processing
How should normalising constants be calculated?

Calculate them based only on the training set and apply them to test/evaluation sets.

normalisation data_processing
What is gradient checking?

A method to verify if the gradient is calculated correctly in the implementation.

gradient_checking neural_networks
What are the two methods to isolate the gradient?
1. Check gradient using weight difference before and after gradient descent.
2. Measure change in loss by altering the weight slightly.
gradient_checking neural_networks
What is the formula for the gradient using weight difference?

∂L(w)/∂w = (w(t-1) - w(t)) / α

gradient_checking formulas
What is the formula for measuring change in loss?

∂L(w)/∂w ≈ (L(w + ε) - L(w - ε)) / (2ε)

gradient_checking formulas
What is the definition of a partial derivative?

The partial derivative of L(w) with respect to w is defined as: \( \frac{\partial L(w)}{\partial w} = \lim_{\epsilon \to 0} \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon} \)

calculus derivative
What indicates a bug in neural network training?

If the values from different methods of calculating partial derivatives are not similar, it indicates a bug.

neural_networks debugging
What is overfitting in neural networks?

Overfitting occurs when a model learns the training data too well, failing to generalize to unseen data.

neural_networks overfitting
How can overfitting be prevented?

To prevent overfitting, use held-out validation and test sets to measure generalization performance.

neural_networks overfitting validation
What is network capacity?

Network capacity refers to the number of parameters in a model and its ability to overfit the dataset.

neural_networks capacity
What does it mean if a model is underfitting?

Underfitting means the model performs poorly on both training and validation sets due to insufficient capacity.

neural_networks underfitting
How can you improve a model that is underfitting?

Increase the number of neurons, parameters, or layers in the model to improve learning.

neural_networks underfitting
What indicates a model is overfitting?

Overfitting is indicated by good performance on the training set but poor performance on the validation set.

neural_networks overfitting
What is one method to prevent overfitting?

Limit the number of parameters in the model to prevent memorization of the dataset.

neural_networks overfitting prevention
What is the best solution to overfitting?

The best solution to overfitting is to acquire more data for training.

neural_networks overfitting data
What is early stopping in neural network training?

Early stopping is a method where training is halted when performance on the validation set does not improve for a set number of epochs.

neural_networks early_stopping
What is regularization in the context of neural networks?

Regularization adds constraints to the model to prevent overfitting, such as penalizing large weights.

neural_networks regularization
What are L2 and L1 regularization?

L2 regularization adds squared weights to the loss function, while L1 regularization adds absolute weights, both helping to control model complexity.

neural_networks regularization l2 l1
What does L2 regularization do to weights?

L2 regularization penalizes larger weights more, encouraging sharing between features and pushing weights towards 0.

neural_networks regularization l2
What does L2 regularisation do?

Adds squared weights to the loss function, penalising larger weights more and encouraging sharing between features.

regularisation l2
What is the formula for L2 regularisation loss function?

The formula is: 𝐽(𝜃) = 𝐿𝑜𝑠𝑠(𝑦, ŷ) + 𝜆 ∑𝑤^2

formula l2
How does L2 regularisation affect weight updates?

The update rule is: 𝑤 ← 𝑤 − 𝛼(𝜕𝐿𝑜𝑠𝑠/𝜕𝑤 + 2𝜆𝑤)

weight_update l2
What is the role of the hyperparameter λ in L2 regularisation?

Controls the importance of regularisation, usually set to a low value (e.g., 0.001).

hyperparameter l2
What does L1 regularisation do?

Adds the absolute value of weights to the loss function, using the sign of the weight for updates.

regularisation l1
What is the formula for L1 regularisation loss function?

The formula is: 𝐽(𝜃) = 𝐿𝑜𝑠𝑠(𝑦, ŷ) + 𝜆 ∑|𝑤|

formula l1
How does L1 regularisation affect weight updates?

The update rule is: 𝑤 ← 𝑤 − 𝛼(𝜕𝐿𝑜𝑠𝑠/𝜕𝑤 + 𝜆 𝑠𝑖𝑔𝑛(𝑤))

weight_update l1
How do L1 and L2 regularisation differ in weight management?

L2 pushes all weights towards 0, while L1 encourages sparsity, keeping many weights at 0.

comparison regularisation
What is dropout in neural networks?

A method to reduce overfitting by randomly setting some neural activations to 0 during training.

dropout overfitting
What percentage of neurons are typically dropped during training with dropout?

About 50% of neurons are typically dropped at each backward pass.

dropout neural_networks
What happens during testing when using dropout?

All neurons are used, but inputs are scaled to match training expectations.

dropout testing
What is the difference between supervised and unsupervised learning?

Supervised learning uses labeled data, while unsupervised learning uses only feature values without labels.

learning supervised unsupervised
What is the objective of unsupervised learning?

To find hidden structures in the dataset without ground-truth labels.

unsupervised objective
What is unsupervised learning?

A type of learning where the dataset consists only of feature values without ground-truth labels.

machine_learning unsupervised_learning
What is the objective of unsupervised learning?

To find hidden structures in the dataset for making inferences or decisions.

machine_learning objectives
What is clustering in unsupervised learning?

The task of finding groups ('clusters') of samples that might belong to the same class.

machine_learning clustering
What is density estimation?

Finding the probability of seeing a point in a certain location compared to another location.

machine_learning density_estimation
What is dimensionality reduction?

A process to reduce the number of features while retaining important information.

machine_learning dimensionality_reduction
Name a famous algorithm for dimensionality reduction.

Principal Component Analysis (PCA).

machine_learning algorithms
What does clustering imply about intra-cluster variance?

There is low intra-cluster variance among instances in the same cluster.

machine_learning clustering
What is the k-means algorithm used for?

To identify a specified number of clusters in a dataset.

machine_learning k-means
What are the steps of the k-means algorithm?

Initialisation, Assignment, Update, and checking for convergence.

machine_learning k-means process
What is a cluster in clustering?

A set of instances that are similar to each other and dissimilar to instances in other clusters.

machine_learning clustering
How does clustering help in vector quantization?

It improves encoding by clustering information in a datastream to reduce data size.

machine_learning vector_quantization
What is an example of using clustering in nature?

Identifying different species of flowers by plotting features like petal length vs. sepal width.

machine_learning nature clustering
What is the structure of an unsupervised learning task?

A feature space with datapoints lacking additional information like labels or values.

machine_learning unsupervised_learning
What does k represent in k-means clustering?

The number of clusters, e.g., k = 3 means there are 3 centroids.

k_means clustering
What is the first step in the k-means algorithm?

Initialisation: Select k random instances or generate random vectors for centroids.

k_means initialisation
What is the goal of the assignment step in k-means?

Assign every point in the dataset to the nearest centroid.

k_means assignment
How do we update centroids in k-means?

By computing the average position of all points in each cluster.

k_means update
What is checked during the convergence step in k-means?

The displacement of centroids; if it's larger than a threshold, loop back to assignment.

k_means convergence
What are Voronoi diagrams?

Diagrams that create decision boundaries equidistant between centroids.

geometry voronoi
What is the formula for the assignment step in k-means?

orall i ext{ in } ext{{1,…,N}} ext{ } c(i) = ext{argmin}_{k ext{ in } ext{{1,…,K}}} ext{ } orm{x(i) - oldsymbol{ u}_k}^2.

k_means assignment_formula
What does the update formula in k-means compute?

The average location for all samples assigned to cluster k.

k_means update_formula
What condition indicates convergence in k-means?

If orall k ext{ } |oldsymbol{ u}_k^t - oldsymbol{ u}_k^{t-1}| < oldsymbol{ ext{ε}}.

k_means convergence_condition
What is checked in Step 4 of K-means?

Convergence by computing the movement of centroids between timesteps.

k-means convergence
What indicates to stop iterating in K-means?

If the movement of centroids is lower than a certain threshold (𝜖).

k-means iteration
How is K-means viewed as a model?

As a model optimization problem with centroid locations and data point assignments.

k-means model
What is the objective of K-means?

Minimize the loss function L for assignments of data points to centroids.

k-means objective
What does the loss function L represent?

The mean distance between samples and their associated centroid.

k-means loss_function
What is the significance of K in K-means?

K is a crucial hyperparameter that affects the clustering results.

k-means hyperparameter
What is the Elbow Method used for?

To determine the optimal value of K by plotting loss values against K.

k-means elbow_method
What should be selected according to the Elbow Method?

The value of K where the rate of decrease in loss sharply shifts.

k-means elbow_method
What does cross-validation help determine?

The best value for hyperparameters using a validation set.

k-means cross-validation
What are the strengths of K-means?

Simple, popular, and efficient with linear complexity.

k-means strengths
What is a significant weakness of K-means?

The need to define K, which significantly impacts results.

k-means weaknesses
What is a significant hyperparameter in K-means?

K (the number of clusters)

k-means hyperparameter
What is a weakness of K-means regarding its results?

It only finds a local optimum and is sensitive to initial centroid positions.

k-means weaknesses
What technique can improve K-means initialization?

K-means++

k-means initialization
When is K-means applicable?

When a distance function exists on the dataset, typically with real values.

k-means applicability
What algorithm works with categorical data in clustering?

K-mode algorithm

k-mode categorical
How does K-medioid algorithm differ from K-means?

It is less sensitive to outliers by using the geometric median.

k-medioid outliers
What shape must clusters have for K-means to work effectively?

Clusters must be hyper-ellipsoids (or hyper-spheres).

k-means cluster_shapes
What is the objective of density estimation algorithms?

To estimate the probability density function p(x) from data.

density_estimation pdf
What does a Probability Density Function (PDF) model?

The likelihood of a continuous variable being observed within an interval.

pdf probability
What must the integral of a PDF over its range equal?

1

pdf integral
What is one application of density estimation?

Anomaly/novelty detection.

density_estimation applications
What is the goal of generative models in relation to probability?

To model the distribution of a class as p(X | y).

generative_models probability
What do discriminative models directly model?

The probability of observing label y given sample values X, p(y | X).

discriminative_models probability
What activation function transforms neural network output into a probability distribution?

Softmax activation.

neural_networks softmax
What does the Softmax activation do?

Transforms the output of the neural network into a probability distribution.

neural_networks activation_functions
What is Bayes’ rule used for in generative models?

To turn the generative model into a discriminative classifier.

bayes classification
What is the formula for Bayes’ rule?

\( p(y | X) = \frac{p(X | y)p(y)}{p(X)} \)

bayes formula
What do non-parametric methods assume about function shape?

They make no assumptions about the form/shape of the function.

non-parametric methods
What is an example of a non-parametric method?

k-NN algorithm.

k-nn non-parametric
What is the bias and variance characteristic of non-parametric methods?

Low bias; high variance depending on the data.

bias variance
What do histograms do in density estimation?

Group data into bins, count occurrences, and normalize.

density_estimation histograms
What does normalization ensure in histograms?

The integral of the function sums to 1, making it a valid PDF.

normalization pdf
What is Kernel Density Estimation?

Estimates the density of a function by using a kernel around training examples.

kernel_density_estimation density_estimation
What does the kernel function do in density estimation?

Computes the difference with the current point x and normalizes according to bandwidth.

kernel_function density_estimation
What is a Parzen window?

A method used in kernel density estimation to define the kernel.

parzen_window kernel
What type of distribution can be used as a kernel in density estimation?

Gaussian distribution.

gaussian kernel
What are the characteristics of parametric approaches?

Make assumptions about the shape, inducing bias but fixing the number of parameters.

parametric bias
What is the univariate Gaussian distribution parameterized by?

Mean (μ) and variance (σ).

univariate gaussian
What is ensured by the normalization factor in Gaussian distribution?

The integral of the distribution sums to 1.

normalization gaussian
What does the multivariate Gaussian distribution take as input?

A multi-dimensional vector.

multivariate gaussian
What is the input of the Multivariate Gaussian Distribution?

A multi-dimensional vector.

statistics gaussian
What replaces variance in the Multivariate Gaussian Distribution?

The covariance matrix Σ.

statistics gaussian
What is the purpose of the normalization term in the Multivariate Gaussian Distribution?

To ensure the double-integral sums to 1.

statistics normalization
What does likelihood determine in a model?

How good the model is at capturing the probability of generating data x.

statistics likelihood
What assumption is made about the datapoints in the training set?

They follow i.i.d distributions.

statistics data
What do we multiply to get the likelihood in a dataset?

The predicted values from the models for every sample with parameters θ.

statistics likelihood
Why do we calculate negative log-likelihood instead of likelihood?

To turn maximization into minimization, similar to training a neural network.

statistics optimization
What does Gaussian fitting minimize?

The negative log likelihood.

statistics gaussian fitting
What happens when you take the log of a multiplication term?

Multiplications turn into sums.

mathematics logarithm
Is the Gaussian distribution sufficient for modeling densities in all cases?

No, it may not be satisfactory for all data distributions.

statistics gaussian
What is the problem with fitting a Gaussian distribution to bimodal data?

It induces bias and may not capture the data's characteristics.

statistics bias
What is a potential solution to the limitations of Gaussian distributions?

Using mixture models to capture different modes of the distribution.

statistics mixture_models
How is the PDF of mixture models defined?

As the weighted sum of multiple PDFs: 𝑝(𝑥) = ∑ 𝜋𝑘𝑝𝑘(𝑥).

statistics mixture_models
What constraints does the mixing proportion 𝜋𝑘 follow?

0 ≤ 𝜋𝑘 ≤ 1 and ∑ 𝜋𝑘 = 1.

statistics constraints
What does the Gaussian Mixture Model (GMM) estimate?

The probability density with p(x) from multiple Gaussian distributions.

statistics gmm
What is the Gaussian Mixture Model a weighted sum of?

Gaussians, ensuring the PDF integrates to 1.

statistics gmm density_estimation
What is a Gaussian Mixture Model (GMM)?

A GMM is a weighted sum of Gaussians.

gmm statistics
What does GMM ensure about the PDF?

The GMM ensures that the PDF integrates to 1, even if it is a mixture of multiple PDFs.

pdf gmm
What is the purpose of GMMs?

GMMs can model complicated data, including multi-modal data.

modeling data gmm
What algorithm is used to fit GMM to training examples?

The Expectation Maximisation (EM) algorithm is used.

em algorithm gmm
What are the two main steps of the EM algorithm?

The two main steps are the E-step (expectation) and the M-step (maximisation).

em steps
What is done in the E-step of the EM algorithm?

Responsibilities for each training example and each mixture component are computed.

e-step em
How is the responsibility calculated in the E-step?

Using the formula: \( r_{ik} = \frac{\pi_k \mathcal{N}(x^{(i)} | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x^{(i)} | \mu_j, \Sigma_j)} \)

responsibility e-step
What is updated in the M-step of the EM algorithm?

GMM parameters are updated using the computed responsibilities.

m-step em
How is the mean updated in the M-step?

The mean is updated using: \( \mu_k = \frac{1}{N_k} \sum_{i=1}^{N} r_{ik} x^{(i)} \)

mean m-step
What is checked for convergence in the EM algorithm?

Convergence is checked by monitoring changes in parameters or log likelihood.

convergence em
What is the Bayesian Information Criterion (BIC)?

BIC is used to select the number of components K in GMM.

bic gmm
What is the formula for BIC?

\( BIC_k = \mathcal{L}(K) + \frac{P_k}{2} \log(N) \)

bic formula
What does \( \mathcal{L}(K) \) represent in the BIC formula?

\( \mathcal{L}(K) \) is the negative log likelihood.

bic log_likelihood
What is the penalty term in the BIC formula?

The penalty term is \( \frac{P_k}{2} \log(N) \), which penalizes complex models.

bic penalty
What does N represent in the BIC formula?

N is the number of examples in the dataset.

bic n
What is the formula for Ck?

Ck = ℒ(K) + Pk/(2 log(N))

formula statistics
What does ℒ(K) represent?

ℒ(K) is the negative log likelihood encouraging fitting of data.

statistics likelihood
What does Pk/(2 log(N)) represent?

It is the penalty term that penalizes complex models.

statistics penalty
What does N represent in the context?

N is the number of examples in the dataset.

data statistics
What does Pk represent?

Pk is the number of parameters.

parameters statistics
How many parameters does a 2D Gaussian have?

Pk = 6K - 1.

gaussian parameters
What are the parameters for the mean in 2D Gaussian?

2 parameters for the mean (2D vector).

gaussian mean
How many parameters are needed for covariance in 2D Gaussian?

3 parameters for the covariance (symmetric 2x2 matrix).

gaussian covariance
What is the purpose of the -1 in the parameter count?

It accounts for the constraint that the sum of mixing proportions must equal 1.

parameters constraints
What principle is suggested for model selection?

Occam’s Razor: pick the simplest model that fits.

modeling occam'srazor
What happens to BIC values as K increases?

BIC values decrease sharply then rise again due to penalty dominance.

bic model_selection
What is cross-validation used for?

To find the most appropriate number of components K.

cross-validation model_selection
What are the steps in cross-validation for GMM-EM?
1. Split into training/validation sets. 2. Run GMM-EM with different K's. 3. Pick K with best likelihood.
cross-validation gmm-em
What is a key similarity between GMM and K-means?

Both require selecting the most appropriate K value for clusters/components.

gmm k-means
What does convergence mean in GMM and K-means?

Convergence occurs when changes in parameters are sufficiently small.

convergence algorithms
How does GMM initialization relate to K-means?

GMM means are often initialized from K-means centroid locations.

gmm initialization
What is soft clustering in GMM?

Every point belongs to several clusters with varying degrees of membership.

clustering gmm
What distance metric is used in GMM?

Distance is related to Mahalanobis distance, encoded by the covariance matrix.

distance gmm
What is the focus of Module 7?

Evolutionary algorithms, including genetic algorithms.

algorithms evolutionary
What is the purpose of genetic/evolutionary algorithms?

Optimization for black box functions.

optimization algorithms
What is a Genetic/Evolutionary Algorithm?

An optimisation method for black box functions without knowing the mathematical equation or gradient, inspired by natural evolution and genetics.

algorithms genetic optimization
What is reinforcement learning?

Learning to maximize a numerical reward, considered an optimization problem.

reinforcement learning optimization
What do traditional RL algorithms deal with?

Discrete states and action spaces.

reinforcement learning states
What do policy search algorithms deal with?

Continuous search spaces represented as 𝑥∗ = argmax 𝑥 𝑓(𝑥).

policy search algorithms
What are Black-Box Optimisation Algorithms?

Algorithms where the links between parameters are unknown at the start of training.

black-box optimization algorithms
What is an example of black-box optimisation in robotics?

The unknown relationship between speed and joint movements.

robotics black-box optimization
What year did Darwin publish his theory about the origin of species?

1859.

history darwin theory
What are the four main concepts of Darwin's theory?
1. Variation is heritable. 2. Resources are finite. 3. Natural selection. 4. Survivors pass traits to offspring.
darwin theory concepts
Who discovered principles of statistical inheritance?

Mendel in 1866.

mendel inheritance genetics
What did Weismann discover in 1883?

Acquired traits are not passed to offspring.

weismann inheritance genetics
What did Watson, Crick, and Franklin discover in 1953?

The structure of DNA.

dna discovery structure
What is a gene?

A sequence of nucleotides in DNA that codes a particular trait.

gene dna trait
What is a genotype?

A set of genes (parameters).

genotype genes parameters
What is a phenotype?

The physiological expression of the genotype.

phenotype genotype expression
What are the three main families of genetic algorithms proposed in the 60s?
1. Evolutionary Strategies. 2. Evolutionary Programming. 3. Genetic Algorithms.
algorithms evolutionary families
What is Genetic Programming?

The evolution of programs, defining and computing a program as a tree.

genetic programming evolution
What is the main concept of genetic/evolutionary algorithms?

They have a population of solutions encoding genotypes, which are developed into phenotypes for evaluation.

genetic algorithms concept
What happens to the worst-performing functions in genetic algorithms?

They are removed (killed), and crossover mutation occurs with better-performing functions.

genetic algorithms performance
What is observed in the black box function?

The output helps to rank the phenotypes.

algorithms output
What happens to the worst functions in the process?

They can be removed (killed).

algorithms selection
What is done to better performing functions?

Cross-over mutation is applied to generate new solutions (offsprings).

algorithms mutation
What is the result of repeating the evolutionary process?

The solution converges to an optimal high-performing solution.

algorithms convergence
What principle do these algorithms use as a base?

A simplified version of Neo-Darwinism.

algorithms principle
How is each solution represented in evolutionary algorithms?

Each solution is represented by a genotype.

algorithms genotype
What function measures the performance of phenotypes?

A fitness function is used.

algorithms fitness
What is the selection operator?

It selects the solutions that will be reproduced.

algorithms selection
What does the cross-over operator do?

It mixes the parents’ genotype to create the offspring.

algorithms crossover
What is the mutation operator?

It applies variations to the genotype after reproduction.

algorithms mutation
What is the genotype in a genetic algorithm?

A binary string of fixed size (e.g., 01001010).

algorithms genotype
What is the genotype in genetic programming?

A program represented as a tree (often in LISP).

algorithms genotype
What does the mutation in evolutionary strategies draw from?

It draws from a Gaussian distribution.

algorithms mutation
What term is used to describe the blurred lines between algorithm families?

Evolutionary Algorithms.

algorithms unification
What is the goal of the Mastermind game?

Finding the secret combination of colors.

games mastermind
How many colors can each piece have in Mastermind?

Each piece can have 6 different colors.

games mastermind
What is the fitness function for Mastermind?

F(x) = p1 + 0.5*p2.

games fitness
What does p1 represent in the Mastermind fitness function?

The number of pieces with the right color and correct position.

games fitness
What does p2 represent in the Mastermind fitness function?

The number of pieces with the right color but wrong position.

games fitness
What does p2 represent in the context of fitness functions?

Number of pieces with the right colour but the wrong position.

fitness p2
What is the formula for the fitness function F(x)?

F(x) = p1 + 0.5*p2

fitness formula
What is the goal of evolutionary algorithms regarding fitness functions?

Maximize the fitness function.

evolutionary algorithms
What is F(x) for solving the problem in this context?

F(x) = 4

fitness problem
What is the fitness function for teaching a robot to walk?

F(x) = walking speed = travelled distance after a few seconds.

robotics fitness
What is the fitness function for teaching a robot to throw an object?

F(x) = distance(object, target).

robotics fitness
What do genotype and phenotype represent in problem-solving?

Potential solutions to the problem.

genotype phenotype
What is the genotype for the Mastermind game?

Binary string with N*3 bits.

mastermind genotype
How is the phenotype created from the genotype in the Mastermind game?

Aggregate bits 3 by 3, each trio becomes an integer.

mastermind phenotype
What do integers correspond to in the Mastermind game?

Different colours: (0=red, 1=yellow, 2=green, 3=blue…).

mastermind colours
What is done with invalid genotypes in the Mastermind game?

Assigned the lowest fitness value to reduce survival chance.

mastermind fitness
What is the purpose of selection operators in evolutionary algorithms?

Select parents for the next generation.

selection evolutionary
What is a standard approach for selection in evolutionary algorithms?

Biased roulette wheel.

selection roulette
How does the biased roulette wheel process work?

Individuals are selected based on their fitness proportion.

selection roulette
What is the first step in the biased roulette wheel process?

Compute the probability pi to select an individual.

selection roulette
What is the alternative to the roulette wheel selection method?

Tournament selection.

selection tournament
What is elitism in evolutionary algorithms?

Keeping a fraction of the best individuals in the new generation.

elitism evolutionary
What fraction is usually fixed for elitism?

10%.

elitism percentage
What is the role of the crossover operator?

Combine traits of the parents.

crossover operators
What is a common method for crossover?

Single-point crossover.

crossover method
What is the role of the mutation operator?

Explore nearby solutions in the local solution space.

mutation operators
How is standard mutation on binary strings performed?

Randomly generate a number for each bit; if lower than probability m, mutate.

mutation binary
What is the first step in rd mutation on binary strings?

Randomly generate a number between 0 and 1 for each bit of the genotype.

mutation binary
What happens if the generated number is lower than probability m?

The bit is flipped.

mutation binary
What is m typically set to in rd mutation?

1/(size of the genotype).

mutation probability
What is the purpose of the specific mutation in the Mastermind problem?

To swap groups of 3 bits in the genotype with probability m2.

mastermind mutation
What is a common stopping criterion for evolutionary algorithms?

When a specific fitness value is reached.

stopping criteria
What fitness value indicates an optimal solution in the example?

A fitness value of 4.

fitness optimal
What is another stopping criterion besides reaching a fitness value?

After a pre-defined number of generations/evaluations.

stopping criteria
What is the first step in the evolutionary algorithm flowchart?

Randomly generate the population.

flowchart evolutionary
What do we do after evaluating the population in the evolutionary loop?

Select individuals to keep for the next generation.

selection evolutionary
What is elitism in the context of evolutionary algorithms?

Keeping a few parents in the new population.

elitism evolutionary
What is the function used to evaluate fitness in Mastermind?

F(x) = p1 + 0.5 p2.

fitness mastermind
What are evolutionary strategies designed to optimize?

Real values in problems.

evolutionary real_values
What is the main difference between genetic algorithms and evolutionary strategies?

Genotype: genetic algorithms use binary strings, evolutionary strategies use real values.

genetic evolutionary
What does the μ + λ evolutionary strategy represent?

Maintains a steady population of μ + λ individuals.

evolutionary strategy
What is the first step in the μ + λ evolutionary strategy?

Randomly generate a population of (μ + λ) individuals.

μ+λ evolutionary
What is the selection process in the μ + λ strategy?

Select the μ best individuals from the population as parents.

selection μ+λ
What is the first step in the evolutionary strategy process?

Randomly generate a population of (μ + λ) individuals.

evolution strategy
What do you do after generating the population?

Evaluate the population.

evaluation population
How many best individuals are selected as parents?

Select the μ best individuals from the population as parents (called x).

selection parents
What is generated from the parents in the evolutionary strategy?

Generate λ offsprings (called y) from the parents.

offspring generation
What is the formula for generating offspring?

For offspring, use the formula: 𝑦𝑖 = 𝑥𝑗 + ℵ(0, 𝜎) where j = random individual in μ.

formula offspring
How is the population defined in the evolutionary strategy?

Population = union of parents and offspring: population = (∪𝑖 𝜆 𝑦𝑖) ∪ (∪𝑗 𝜇 𝑥𝑗).

population union
What is the main challenge in evolutionary strategies?

The main challenge comes in fixing the hyperparameter 𝜎.

challenge hyperparameter
What happens if 𝜎 is too large?

If 𝜎 is too large, the population moves quickly to the solution but struggles to refine it.

sigma population
What happens if 𝜎 is too small?

If 𝜎 is too small, the population moves slowly and might be affected by local optima.

sigma local_optima
How can 𝜎 be adjusted over time?

Change 𝜎's value over time to adapt to the situation by adding sigma into the genotype.

adaptation genotype
What is the new genotype defined as?

Define another genotype as xj’ = {xj, σj} composed of the initial genotype and sigma value.

genotype definition
How is the new offspring's sigma calculated?

Calculate 𝜎𝑖 = 𝜎𝑗 exp(𝜏0ℵ(0, 1)).

sigma calculation
What does the learning rate depend on?

The learning rate 𝜏0 is proportional to 1/√𝑛, where n is the number of dimensions of the genotype.

learning_rate dimensions
Why is substituting 𝜎 with 𝜏0 beneficial?

The selection of 𝜏0 is less critical than the value of 𝜎, allowing more flexibility in setting it.

substitution flexibility
What is a variant of evolutionary strategies?

CMA-ES algorithm, which evolves a covariance matrix.

cma-es variant
What is an approach to genetic algorithms?

Discretise the parameters and use binary strings.

genetic_algorithms discretisation
What is the goal of taking inspiration from natural evolution?

To find effective solutions for survival and adaptation in environments.

natural_evolution adaptation
What is the purpose of novelty search?

To use novelty instead of fitness value to drive the search for optimality.

novelty search
What does the rch algorithm focus on instead of fitness?

Novelty value

algorithm novelty
What is the purpose of the novelty archive?

To store all encountered solutions for novelty calculation

novelty archive
How is novelty calculated?

By summing distances to the k nearest neighbors (k=3)

novelty calculation
What does a larger novelty indicate?

More difference from previous solutions

novelty differences
What does the behavioral descriptor characterize?

Aspects of solutions and distances between them

behavior descriptor
Why is the behavioral descriptor task-specific?

It defines features to compare based on the task

task descriptor
What can happen if a feature is ignored in the behavioral descriptor?

Loss of potentially useful information

information descriptor
What is an example of a behavioral descriptor for a robot?

(x, y) coordinates of the robot's final position

robot descriptor
What problem can a fitness-focused algorithm encounter?

Getting stuck in local minima

algorithm fitness
How does novelty search differ from traditional evolutionary algorithms?

It uses novelty score instead of fitness for evaluation

novelty evaluation
What is the goal of Quality-Diversity Optimization?

To learn diverse and high-performing solutions in one process

optimization quality
What does the concept of Quality-Diversity Optimization apply to?

Real-valued search space

quality diversity
What is a potential benefit of novelty search for a bipedal robot?

Leads to a more stable and successful robot

robot stability
What is the goal of high-dimensional hyperspace exploration?

To find points that lead to the most interesting solutions.

hyperspace exploration
What does the concept of behavioural descriptors help generate?

A collection of high-performing solutions with high diversity and performance.

behavioural_descriptors solutions
How many degrees of freedom does the robot in the example have?

12 degrees of freedom (2 in each leg).

robot degrees_of_freedom
How many real-valued dimensions are there for the robot's movement?

36 real-valued dimensions.

dimensions robot_movement
What is the behavioural descriptor for the robot's movement?

Proportion of time each leg touches the ground (6 dimensions).

behavioural_descriptor robot
What is the goal of varying the proportions of time each leg spends touching the ground?

To find an optimal solution for walking as fast as possible.

robot walking optimization
How many ways to walk were found using the MAP-Elites algorithm?

Over 13,000 ways to walk.

map-elites walking
What are the two main focuses of Quality-Diversity (QD) algorithms?

Measuring performance of solutions and distinguishing different types of solutions.

qd_algorithms performance diversity
What is a fitness function used for in QD algorithms?

To measure the performance of solutions.

fitness_function qd_algorithms
What does the behavioural descriptor characterize in QD algorithms?

It distinguishes different types of solutions.

behavioural_descriptor qd_algorithms
What does Novelty Search with Local Competition optimize?

Two fitness functions: novelty score and local competition.

novelty_search local_competition
What is the concept of Local Competition in QD algorithms?

Comparing new solutions only with similar ones in the same categories.

local_competition comparison
What does LC(x) represent in Local Competition?

Number of solutions that x outperforms within its k nearest neighbours.

local_competition performance
What happens when a better version of a solution is found in the archive?

The worse version is replaced by the better one.

archive solution_replacement
What is the goal of MAP-Elites?

To discretise the behavioural descriptor space in a grid and fill it with the best solutions.

map-elites qd_algorithms
What does MAP-Elites stand for?

Multi-Dimensional Archive of Phenotypic Elites.

map-elites acronyms
What is the main advantage of MAP-Elites?

Easy to implement and performs well in general.

advantages map-elites
What is a disadvantage of MAP-Elites?

Density of the solution is not always uniform.

disadvantages map-elites
How does MAP-Elites add new solutions?

If the cell is empty, the new solution is added; if occupied, the best fitness solution is kept.

map-elites addition_mechanism
What is the hyper-parameter in MAP-Elites?

Size of the cells (resolution of the grid).

hyper-parameter map-elites
What is the first step in the MAP-Elites process?

Randomly initialise some solutions to place in the grid.

map-elites process
What happens during the mutation operator in MAP-Elites?

Gaussian noise is added to some/all values of the selected solution.

map-elites mutation
What is a common metric for diversity in MAP-Elites?

Archive size (number of solutions stored in the collection).

metrics map-elites
What does the QD-score represent?

The sum of the fitness of all solutions in the archive.

qd-score metrics
What is the trade-off in QD algorithms represented by?

A Pareto-front to define the best variant of the algorithm.

trade-off qd_algorithms
What is the usual metric for performance in MAP-Elites?

Max or mean fitness value of all solutions.

performance map-elites
What does the coverage refer to in MAP-Elites?

Number of filled cells, number of individuals, or % of filled cells in the grid.

coverage map-elites
What is the purpose of local competition in the algorithm?

To explore many different solutions in the entire space.

local_competition algorithm
What is the addition mechanism in MAP-Elites?

It determines how new solutions are added to the grid based on their fitness.

addition_mechanism map-elites
What is a general framework in QD algorithms?

Allows use of different operators to define quality diversity algorithms for specific tasks.

algorithms qd
What does the selector do in QD algorithms?

Selects the individual to be mutated and evaluated in the next generation.

selector qd
What is the simplest selection method used in MAP-Elites?

Uniform random selection over the solutions in the container.

selection map-elites
What are the criteria for proportional selection in QD?

Fitness, novelty, curiosity score.

selection criteria
How can solutions be stored in QD?

Discretised grid (like MAP-Elites) or unstructured archive (like Novelty Search).

storage qd
What is a key feature of the unstructured archive in QD?

Maintains density instead of strict discretisation.

archive density
What is the process for using advanced mutations in QD?

Select multiple operators in stochastic selection, then apply cross-over before mutation.

mutations cross-over
What is the QD algorithm for teaching a robot to walk?

Unstructured archive + random uniform selector.

robotics qd
What is the behavioral descriptor for the walking robot?

X/Y coordinate position of the robot after 3 seconds.

robotics behavior
What is the fitness measure for the walking robot?

Angular error at the end of the trajectory w.r.t. an ideal circular trajectory.

fitness robotics
What is the QD algorithm for teaching a robot to push a cube?

MAP-Elites (grid + random uniform selector).

robotics qd
What is the behavioral descriptor for the cube-pushing robot?

Final position of the cube, where diversity is desired.

robotics behavior
What is the fitness measure for the cube-pushing robot?

Energy efficiency of the movement.

fitness robotics
What are genetic algorithms, evolutionary strategies, and evolutionary algorithms based on?

The same basic concepts.

algorithms evolutionary

Sign up to unlock more features

Intro to Machine Learning Notes

Info

Export

Study

Edit

Manage

Flashcards in this deck (794)