Data Science is the practice of analyzing and interpreting complex data to derive insights and make informed decisions, often using machine learning and statistical techniques.
Machine Learning is a subset of AI where algorithms learn patterns from data to make decisions or predictions without being explicitly programmed.
A neural network is a series of algorithms that mimic the operations of the human brain to recognize patterns, often used in deep learning models.
Supervised learning is a type of machine learning where the model is trained on labeled data, meaning both inputs and outputs are known.
Unsupervised learning involves training models on data that has no labeled responses, allowing the model to find hidden structures or patterns.
Reinforcement learning is an area of ML where agents learn to make decisions by performing actions in an environment and receiving rewards or penalties.
A decision tree is a supervised learning algorithm used for classification and regression tasks, where data is split into branches based on feature values.
Random Forest is an ensemble method that creates multiple decision trees and aggregates their results to improve accuracy and reduce overfitting.
Use techniques like cross-validation, regularization (L1, L2), pruning decision trees, and reducing the complexity of the model.
A confusion matrix is a table used to evaluate classification models by showing the counts of true positives, false positives, true negatives, and false negatives.
Gradient Descent is an optimization algorithm used to minimize a cost function by iteratively adjusting parameters in the direction of the steepest descent.
SVM is a supervised learning algorithm used for classification and regression by finding the hyperplane that best separates data points of different classes.
The bias-variance tradeoff refers to the balance between a model’s ability to generalize and its accuracy on training data. High bias leads to underfitting, while high variance leads to overfitting.
Logistic Regression is a classification algorithm used when the target variable is binary. It models the probability that an instance belongs to a class using a logistic function.
Regularization is a technique to penalize large model coefficients to prevent overfitting, commonly applied in models like Ridge (L2) and Lasso (L1) regression.
PCA is a dimensionality reduction technique that transforms data into a set of uncorrelated components, capturing the maximum variance with fewer variables.
Shallow networks have one or two hidden layers, while deep networks have many hidden layers, allowing them to model more complex relationships.
Cross-validation is a technique to assess a model’s performance by splitting the data into training and test sets multiple times and averaging the results.
A ROC curve is a graphical representation of a classifier’s performance, plotting the true positive rate against the false positive rate at different thresholds.
RNNs are neural networks designed to handle sequential data by maintaining a ‘memory’ of previous inputs, making them ideal for time-series forecasting and natural language processing.