Machine Learning

The first rule for implementing something with ML or blockchainโ€ฆ
is to figure out if you can implement it without ML or blockchain.

How does Machine Learning work?

UC Berkeley breaks out the learning system of a machine learning algorithm into three main parts.

  • A Decision Process: In general, machine learning algorithms are used to make a prediction or classification. Based on some input data, which can be labeled or unlabeled, your algorithm will produce an estimate about a pattern in the data.
  • An Error Function: An error function evaluates the prediction of the model. If there are known examples, an error function can make a comparison to assess the accuracy of the model.
  • A Model Optimization Process: If the model can fit better to the data points in the training set, then weights are adjusted to reduce the discrepancy between the known example and the model estimate. The algorithm will repeat this iterative โ€œevaluate and optimizeโ€ process, updating weights autonomously until a threshold of accuracy has been met.

Top machine Learning Algorithms every data enthusiast should know

๐Ÿญ. ๐‹๐ข๐ง๐ž๐š๐ซ ๐‘๐ž๐ ๐ซ๐ž๐ฌ๐ฌ๐ข๐จ๐ง
Type: Supervised
Use: Predicting continuous values (e.g., stock prices)
Explanation: Finds the relationship between input and output variables by fitting a straight line. Best for simple, linear relationships.

๐Ÿฎ. ๐‹๐จ๐ ๐ข๐ฌ๐ญ๐ข๐œ ๐‘๐ž๐ ๐ซ๐ž๐ฌ๐ฌ๐ข๐จ๐ง
Type: Supervised
Use: Classification problems (e.g., spam vs. not spam)
Explanation: Used to model binary outcomes by fitting data to a sigmoid curve, outputting probabilities.

๐Ÿฏ. ๐ƒ๐ž๐œ๐ข๐ฌ๐ข๐จ๐ง ๐“๐ซ๐ž๐ž๐ฌ
Type: Supervised
Use: Classification & regression
Explanation: Splits data into branches to make decisions based on conditions. Offers easy interpretability but may overfit without tuning.

๐Ÿฐ. ๐‘๐š๐ง๐๐จ๐ฆ ๐…๐จ๐ซ๐ž๐ฌ๐ญ
Type: Supervised
Use: Classification & regression
Explanation: Combines multiple decision trees for robust predictions, reducing overfitting. Ideal for handling complex data with noise.

๐Ÿฑ. ๐’๐ฎ๐ฉ๐ฉ๐จ๐ซ๐ญ ๐•๐ž๐œ๐ญ๐จ๐ซ ๐Œ๐š๐œ๐ก๐ข๐ง๐ž๐ฌ (๐’๐•๐Œ)
Type: Supervised
Use: Classification
Explanation: Finds the hyperplane that best separates data points into classes. Works well for high-dimensional data.

๐Ÿฒ. ๐Š-๐๐ž๐š๐ซ๐ž๐ฌ๐ญ ๐๐ž๐ข๐ ๐ก๐›๐จ๐ซ๐ฌ (๐Š๐๐)
Type: Supervised
Use: Classification
Explanation: Classifies based on the majority class among nearest neighbors. Best for low-dimensional, well-labeled data.

๐Ÿณ. ๐๐š๐ขฬˆ๐ฏ๐ž ๐๐š๐ฒ๐ž๐ฌ
Type: Supervised
Use: Text classification, spam detection
Explanation: Uses probability for predictions, assuming feature independence. Often effective with text and sentiment analysis.

๐Ÿด. ๐Š-๐Œ๐ž๐š๐ง๐ฌ ๐‚๐ฅ๐ฎ๐ฌ๐ญ๐ž๐ซ๐ข๐ง๐ 
Type: Unsupervised
Use: Grouping data (e.g., customer segmentation)
Explanation: Clusters data points around centroids, used to find patterns without labeled data.

๐Ÿต. ๐๐ซ๐ข๐ง๐œ๐ข๐ฉ๐š๐ฅ ๐‚๐จ๐ฆ๐ฉ๐จ๐ง๐ž๐ง๐ญ ๐€๐ง๐š๐ฅ๐ฒ๐ฌ๐ข๐ฌ (๐๐‚๐€)
Type: Unsupervised
Use: Reducing data dimensions

๐Ÿญ๐Ÿฌ. ๐๐ž๐ฎ๐ซ๐š๐ฅ ๐๐ž๐ญ๐ฐ๐จ๐ซ๐ค๐ฌ
Type: Supervised/Unsupervised
Use: Complex tasks like image & language processing

Top machine learning algorithsms


From Data to Decisions: Unpacking the AI Pipeline

This session falls into the category of Machine Learning, specifically focusing on the Data Science and Model Development aspects of AI lifecycle management. By understanding the AI process, we can appreciate how data is transformed into intelligent decisions. From collecting and preprocessing data to training models and deploying them, each step plays a vital role in making AI systems effective and reliable.

1๏ธโƒฃ Data Collection: The AI process begins with gathering data from different sources. This can include information like numbers, text, images, or videos. The data acts as the building blocks for AI systems, helping them learn and make decisions. Think of it as the raw material we need to work with!

2๏ธโƒฃ Data Preprocessing: Once we have the data, we need to clean and organize it. This step involves removing any errors, duplicates, or irrelevant parts. We also make sure the data is in a format that the AI algorithms can understand. It's like tidying up the data so that it's ready for analysis!

3๏ธโƒฃ Feature Extraction: Now, we need to extract the most important parts of the data. These are called features, and they help the AI algorithms understand what's significant in the data. It's like highlighting the essential details that will guide the AI system's decision-making process.

4๏ธโƒฃ Model Training: Next, we feed the extracted features into AI models. These models are like intelligent algorithms that learn from the data. We train them by repeatedly showing them examples and helping them adjust their settings to make accurate predictions or decisions. It's like teaching a model to recognize patterns or make judgments based on what it has learned!

5๏ธโƒฃ Model Evaluation: Once the model has been trained, we need to check how well it performs. We use evaluation metrics to measure its accuracy or effectiveness. This step helps us ensure that the model is reliable and provides valuable insights. It's like testing the model to make sure it's doing a good job!

6๏ธโƒฃ Deployment and Inference: After training and evaluation, we put the model to work in the real world. We integrate it into systems or applications where it can process new, unseen data and provide predictions or decisions. It's like unleashing the power of the trained model to make practical use of its intelligence!

7๏ธโƒฃ Continuous Monitoring and Improvement: AI is an ongoing process. We regularly monitor the model's performance, collect feedback, and update it as needed. This ensures that the AI system remains accurate and aligned with the desired outcomes. It's like taking care of the model and making improvements to keep it at its best!

Machine Learning steps


Difference between ML, AI and DL

The terms "AI,โ€ "machine learning" and "deep learning" are often used interchangeably- but they don't mean the same thing. Here's a breakdown of how they differ.

AI
(Neural Networks (1950sโ€“1970s) / Gen AI (present))
Artificial intelligence studies how computers mimic the functions of natural intelligence. The term was coined in 1956, and includes everything from machine learning to cybernetics, machine ethics and more.
Machine Learning
(1980sโ€“2010s)
Machine learning can be split into unsupervised and supervised learning. In unsupervised machine learning, algorithms attempt to structure unlabeled data in meaningful ways and uncover hidden patterns, for example, through clustering. In supervised learning, algorithms learn to make predictions from a training dataset of labeled data, such as assigning a known class to previously unseen data.
Deep Learning
(2011โ€“2020s)
Deep learning is a subset of machine learning that mimics the structure of the human brain to solve both supervised and unsupervised tasks, using multiple layers of artificial neural networks to make progressively more abstract and higher-level decisions.

Machine Learning steps

How to select ML model based on the problem

A comprehensive guide to choosing the right machine learning model for your problem, from image generation to natural language understanding.

  • 1. Image Generation: Diffusion Models
  • 2. Chatbot: LLMs (e.g., GPT-4)
  • 3. Image Recognition: YOLO
  • 4. Time Series Prediction: Prophet
  • 5. Object Detection: YOLO
  • 6. Speech Recognition: Transformer Models (e.g., Wav2Vec)
  • 7. Sentiment Analysis: RoBERTa
  • 8. Text Summarization: BART, T5
  • 9. Anomaly Detection: Isolation Forest, Autoencoders
  • 10. Recommendation Systems: Matrix Factorization (e.g., ALS)
  • 11. Machine Translation: Transformer Models (e.g., MarianMT)
  • 12. Text Classification: BERT
  • 13. Clustering: DBSCAN
  • 14. Dimensionality Reduction: UMAP
  • 15. Regression Analysis: Gradient Boosting Machines (e.g., XGBoost)
  • 16. Classification: Random Forest
  • 17. Speech Synthesis: WaveNet
  • 18. Natural Language Understanding (NLU): Transformer Models (e.g., GPT, BERT)
  • 19. Style Transfer: Neural Style Transfer
  • 20. Object Tracking: Siamese Networks

Machine Learning tooling

๐Ÿ† A ranked list of awesome machine learning Python libraries.

Top-10 study list

My top-10 study list to learn Machine Learning:

  1. Python
  2. Intro Data Science
  3. Intro Machine Learning
  4. Version Control
  5. Intro Algos & Data Structures
  6. Intro Linear Algebra
  7. Intro Calculus
  8. Deep Learning
  9. Intro Proba & Stats
  10. Parallel Computing

10 steps of Machine Learning

  1. Data collection
  2. Data prep / cleaning
  3. EDA
  4. Feature Engineering (if needed)
  5. Modeling (selection, CV)
  6. Offline evaluation
  7. Integ test
  8. Shadow, A/B
  9. Monitor, log, maintain
  10. Retrain

Hyperparamaters

Machine Learning hyperparameters are crucial for optimizing model performance.

Machine Learning hyperparameters

  1. Linear Regression: Focuses on the regularization parameter (alpha) to prevent overfitting.
  2. Logistic Regression: Uses the C parameter for regularization strength, with penalties like L1 and L2.
  3. Decision Tree: Key hyperparameters include max_depth and min_samples_split, which control tree complexity.
  4. K-Nearest Neighbors: Important parameters are n_neighbors and distance metrics to influence model decisions.
  5. Support Vector Machines: Utilizes C, kernel type, gamma, and degree for tuning model capability.

More complete list:

Machine Learning hyperparameters


Choose the right model

Choose the right Machine Learning model


Machine Learning types

The 3 types of machine learning (that every data scientist should know). Here's 3 months of research in 3 minutes. Let's go! โ˜บ๏ธ

  1. The 3 Fundamental Types of Machine Learning: Supervised Learning, Unsupervised Learning, and Reinforcement Learning. Let's break them down.
  2. Supervised Learning: Supervised Learning maps a set of inputs (features) to an output (target). There are 2 types: Classification and Regression.
  3. Classification: Identifying the category that something belongs to. Often I use Binary Classification for lead scoring to get a class probability (the probability from 0 to 1 of how likely the rowwise observation belongs to a class). Think non-buyer or buyer. 0 or 1. Binary Classification.
  4. Regression: Predicting a continuous value. I commonly use Regression for predicting future values of sales demand. It's a special type of regression called Forecasting.
  5. Unsupervised Learning: Learning from unlabelled data. 2 main types I use are clustering and dimensionality reduction. K-means is the most common clustering algorithm I use, often for clustering customers based on their similarities. I use PCA to reduce the number of columns so other supervised machine learning algorithms run more efficiently and to visualize clusters.
  6. Reinforcement Learning: The idea is that the software learns to take actions based on accumulation of reward. This is the underlying concept of "AI" or Artificial Intelligence, where the software learns to think.
  7. Learning Roadmap (based on real life): I highly recommend learning how to apply concepts 1 to 5 to business applications. I use these all day every day. Number 6 Reinforcement Learning I have never used Reinforcement Learning, but it's a powerful concept. Down the road I may take a stab at it and report back. Skip it until I know more (unless there is a specific application you need it for).
Frequently used algorithms for biomedical researchExample Usage (Data Type)Type of learning
Machine Learning
(SL)
SVMCancer vs healthy classification (gene expression)Supervised Learning (SL):
  • Classification
  • Regression
  • Support Vector Machines
  • Linear Regression
  • GLM
  • Discriminant Analysis
  • SVR
  • GPR
  • Naive Bayes
  • Ensemble Methods
  • Nearest Neighbor
  • Decision Trees
  • Neural Networks
KNNMulticlass tissue classification (gene expression)
RegressionGenome-wide association analysis (SNP)
Random forestPathway-based classification (gene expression, SNP)
Deep Learning
(SL)
CNNProtein secondary structure prediction (amino acid sequence)
RNNSequence similarity prediction (nucleotide sequence)
Clustering
(UL)
HierarchicalProtein family clustering (amino acid sequence)Unsupervised Learning (UL):
  • Clustering
  • K-means
  • K-Medoids
  • Fuzzy C-Means
  • Hierarchical
  • Gaussian Mixture
  • Neural Networks
  • Hidden Markov Model
K-meansClustering genes by chromosomes (gene expression)
Dimensionality
Reduction (UL)
PCAClassification of outliers (gene expression)
tSNEData visualization (single cell RNA-sequencing)
NMFClustering gene expression profiles (gene expression)

Machine Learning

Machine Learning

Machine Learning


Machine Learning Ops

DesignModel DevelopmentOperations
Requirements engineeringData EngineeringML Model Deployment
ML Use-Cases prioritizationML Model EngineeringCI/CD Pipelines
Data availability checkModel testing & validationMonitoring & Triggering

Machine Learning operations


Machine Learning Tools

Machine Learning languages: Python, R, C++, Java, Prolog, Lisp, Lush.
Data Analysis & Visualisation tools: Pandas, Matplotlib, Jupyter Notebook, Weka, Tableau.
Big Data tools: MemSQL, Apache Spark.
Machine Learning platforms & frameworks: Numpy, Scikit-learn, NLTK, Azure ML, Apache Mahout, Knime, Weka, Amazon ML, Rapid Miner, Colab, Scikit Learn, Tensor Flow, Keras, PyTorch, Shougan.
Machine Learning frameworks for natural network modelling: Pytorch, Keras, Caffe 2, Tensorflow & Tensorboard.
Maths for Machine Learning: Linear Algebra, Statistics, Geometry, Calculus, Probability, Regression.

Tensorflow isn't a high-level visualization library. Plotly, Seaborn and Matplotlib are.

Machine Learning Algorithms and Frameworks

1. Supervised Learning

Classification

  • K-Nearest Neighbors (KNN)
  • Logistic Regression
  • Naive Bayes
  • Decision Trees
  • Support Vector Machines (SVM)
  • Random Forest
  • Gradient Boosting Machines (GBM)
  • Neural Networks (MLP, CNN)

Regression

  • Linear Regression
  • Polynomial Regression
  • Ridge Regression
  • Lasso Regression
  • Elastic Net
  • Support Vector Regression (SVR)
  • Decision Trees
  • Random Forest
  • Gradient Boosting
2. Unsupervised Learning

Clustering

  • K-Means
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
  • Hierarchical Clustering
  • Gaussian Mixture Models (GMM)

Dimensionality Reduction

  • Principal Component Analysis (PCA)
  • Singular Value Decomposition (SVD)
  • Independent Component Analysis (ICA)
  • t-Distributed Stochastic Neighbor Embedding (t-SNE)
  • Linear Discriminant Analysis (LDA)

Association

  • Apriori Algorithm
  • ECLAT Algorithm
  • FP-Growth Algorithm
3.Ensemble Learning

Bagging

  • Random Forest
  • Bootstrap Aggregating (Bagging)

Stacking

  • Stacked Generalization
  • Blending

Boosting

  • AdaBoost
  • Gradient Boosting Machines (GBM)
  • XGBoost
  • LightGBM
  • CatBoost
4. Neural Networks

Feedforward Neural Networks

  • Multilayer Perceptron (MLP)
  • Convolutional Neural Networks (CNN)

Recurrent Neural Networks

  • Long Short-Term Memory Networks (LSTM)
  • Gated Recurrent Units (GRU)

Generative Models

  • Generative Adversarial Networks (GAN)
  • Variational Autoencoders (VAE)

Specialized Networks

  • Transformer Networks
  • Autoencoders
  • Radial Basis Function Networks (RBFN)
5. Reinforcement Learning

Value-Based

  • Q-Learning
  • Deep Q-Network (DQN)

Policy-Based

  • REINFORCE Algorithm
  • Proximal Policy Optimization (PPO)

Model-Based

  • AlphaZero
  • Dyna-Q

Other Algorithms

  • Actor-Critic Methods (A3C, A2C)
  • Deep Deterministic Policy Gradient (DDPG)
  • Twin Delayed Deep Deterministic Policy Gradient (TD3)
  • Soft Actor-Critic (SAC)

Overfitting Business case

  1. Overfitting is a common issue in machine learning and statistical modeling. It occurs when a model is too complex and captures not only the underlying pattern in the data but also the noise.
  2. Key Characteristics of Overfitting: High Performance on Training Data, Poor Performance on Test Data, Overly Complex with many parameters, Sensitive to minor fluctuations in training data (not robust).
  3. How to Avoid Overfitting (and Underfitting): The goal is to get a model trained to the point where it's robust (not overly sensitive) and generalizes well to new data (unseen during model training). How we do this is to balance bias and variance tradeoff. Common techniques: K-Fold Cross Validation, Regularization (penalizing features), and even simplifying the model.
  4. How I learned about overfitting (business case): I was making a forecast model using linear regression. The model had dozens of features: lags, external regressors, economic features, calendar features… You name it, I included it. And the model did well (on the training data). The problem came when I put my first forecast model into productionโ€ฆ
  5. Lack of Stability (is a nice way to put it): My model went out-of-wack. The linear regression predicted demand for certain products 100X more than it's recent trends. And luckily the demand planner called me out on it before the purchase orders went into effect.
  6. I learned a lot from this: Linear regression models can be highly sensitive. I switched to penalized regression (elastic net) and the model became much more stable. Luckily my organization knew I was onto something, and I was given more chances to improve.
  7. The end result: We actually called the end of the Oil Recession of 2016 with my model, and workforce planning was ready to meet the increased demand. This saved us 3 months of inventory time and put us in a competitive advantage when orders began ramping up.

Estimated savings: 10% of sales x 3 months = $6,000,000. Pretty shocking what a couple data science skills can do for a business.

Overfitting


MLOps

This repo covers everything you need to know about MLOps.

The goal of the series is to understand the basics of MLOps like model building, monitoring, configurations, testing, packaging, deployment, cicd, etc.

  • Week 0: Project Setup
  • Week 1: Model Monitoring - Weights and Biases
  • Week 2: Configurations - Hydra
  • Week 3: Data Version Control - DVC
  • Week 4: Model Packaging - ONNX
  • Week 5: Model Packaging - Docker
  • Week 6: CI/CD - GitHub Actions
  • Week 7: Container Registry - AWS ECR
  • Week 8: Serverless Deployment - AWS Lambda
  • Week 9: Prediction Monitoring - Kibana

ML Ops


Machine learning glossary

Term Meaning
association The extent to which values of one field depend on or are predicted by values of another field.
bagging A modeling technique that is designed to enhance the stability of the model and avoid overfitting. See also boosting, overfitting.
batch scoring Running the model predictions offline (asynchronously) on a large dataset.
Bayesian network A graphical model that displays variables in a data set and the probabilistic or conditional in-dependencies between them.
binomial logistic regression A logistic regression that is used for targets with two discrete categories. See also multinomial logistic regression, target.
boosting A modeling technique that creates a sequence of models, rather than a single model, to obtain more accurate predictions. Cases are classified by applying the whole set of models to them, and then combining the separate predictions into one overall prediction. See also bagging.
classification and regression tree algorithm A decision tree algorithm that uses recursive partitioning to split the training records into segments by minimizing the impurity at each step. See also Quick, Unbiased, Efficient Statistical Tree algorithm.
confidence score An estimate of the accuracy of a prediction, usually expressed as a number from 0.0 to 1.0.
correlation A statistical measure of the association between two numeric fields. Values range from -1 to +1. A correlation of 0 means that there is no relationship between the two fields.
Cox regression algorithm An algorithm that produces a survival function that predicts the probability that the event of interest has occurred at a given time for given values of the predictor variables.
cross-validation A technique for testing how well a model generalizes in the absence of a holdout test sample. Cross-validation divides the training data into a number of subsets, and then builds the same number of models, with each subset held out in turn. Each of those models is tested on the holdout sample, and the average accuracy of the models on those holdout samples is used to estimate the accuracy of the model when applied to new data. See also overfitting.
data quality The extent to which data has been accurately coded and stored. Factors that adversely affect data quality include missing values, data entry errors, measurement errors, and coding inconsistencies.
data set A collection of data, usually in the form of rows (records) and columns (fields) and contained in a file or database table.
data visualization The process of presenting data patterns in graphical format, including the use of traditional plots as well as advanced interactive graphics. In many cases, visualization reveals patterns that would be difficult to find using other methods.
decision list An algorithm that identifies subgroups or segments that show a higher or lower likelihood of a given binary (yes/no) outcome relative to the overall population.
decision tree algorithm An algorithm that classifies data, or predicts future outcomes, based on a set of decision rules.
deployment The process of enabling the widespread use of a predictive analytics project within an organization.
evaluate The process of determining whether a model will accurately predict the target on new and future data.
heat map A graphical representation of data values in a two-dimensional table format, in which higher values are represented by darker colors and lower values by lighter ones.
histogram A graphical display of the distribution of values for a numeric field, in the form of a vertical bar chart in which taller bars indicate higher values.
linear regression A statistical technique for estimating a linear model for a continuous (numeric) output field. Linear models predict a continuous target based on linear relationships between the target and one or more predictors. See also regression.
linear regression model A modeling algorithm that assumes that the relationship between the input and the output for the model is of a particular, simple form. The model fits the best line through linear regression and generates a linear mapping between the input variables and each output variable.
logistic regression A statistical technique for classifying records based on the values of the input fields. Logistic regression is similar to linear regression, but takes a categorical target field instead of a numeric one. See also regression.
misclassification cost A specification of the relative importance of different kinds of classification errors, such as classifying a high-risk credit applicant as low risk. Costs are specified in the form of weights applied to specific incorrect predictions.
model building The process of creating data models by using algorithms. Model building typically consists of several stages: training, testing and (optionally) validation of evaluation. See also testing, training, validation.
multinomial logistic regression A logistic regression that is used for targets with more than two categories. See also binomial logistic regression, target.
neural network A mathematical model for predicting or classifying cases by using a complex mathematical scheme that simulates an abstract version of brain cells. A neural network is trained by presenting it with a large number of observed cases, one at a time, and allowing it to update itself repeatedly until it learns the task.
online scoring Apply model prediction real time on a single record through a published endpoint within or outside the organization, expects fast response in terms of milliseconds.
overfitting The unintentional modeling of chance variations in data, leading to models that do not work well when applied to other data sets. Bagging and cross-validation are two methods for detecting or preventing overfitting. See also bagging, cross-validation.
partition To divide a data set into separate subsets or samples for the training, testing, and validation stages of model building.
predictive analytics A business process and a set of related technologies that are concerned with the prediction of future possibilities and trends. Predictive analytics applies such diverse disciplines as probability, statistics, machine learning, and artificial intelligence to business problems to find the best action for a given situation.
Predictive Model Markup Language (PMML) An XML-based language defined by the Data Mining Group that provides a way for companies to define predictive models and share models between compliant vendors' applications.
probability A measure of the likelihood that an event will occur. Probability values range from 0 to 1; 0 implies that the event never occurs, and 1 implies that the event always occurs. A probability of 0.5 indicates that the event has an even chance of occurring or not occurring.
Quick, Unbiased, Efficient Statistical Tree algorithm (QUEST) A decision tree algorithm that provides a binary classification method for building the tree. The algorithm is designed to reduce the processing time required for large C & R tree analyses while also reducing the tendency found in classification tree methods to favor inputs that allow more splits. See also classification and regression tree algorithm, decision tree algorithm.
Regression A statistical technique for estimating the value of a target field based on the values of one or more input fields. See also linear regression, logistic regression.
Regression tree algorithm A tree-based algorithm that splits a sample of cases repeatedly to derive homogeneous subsets, based on values of a numeric output field. See also Chi-squared Automatic Interaction Detector algorithm.
score To apply a predictive model to a data set with the intention of producing a classification or prediction for a new, untested case.
script A series of commands, combined in a file, that carry out a particular function when the file is run. Scripts are interpreted as they are run.
testing The stage of model building in which the model produced by the training stage is tested against a data subset for which the outcome is already known. See also model building, training, validation.
training The initial stage of model building, involving a subset of the source data. The model can then be tested against a further, different subset for which the outcome is already known. See also model building, testing, validation.
transformation A formula that is applied to the values of a field to alter the distribution of values. Some statistical methods require that fields have a particular distribution. When a field's distribution differs from what is required, a transformation (such as taking logarithms of values) can often remedy the problem.
unrefined model A model that contains information extracted from the data but which is not designed for generating predictions directly.
validation An optional final stage of model building in which the refined model from the testing stage is validated against a further subset of the source data. See also model building, testing, training.

Roadmap

Mathematics: Probability | Statictics | Discrete
Programming: Python | R | Java
Database: MySQL | MongoDB
Machine Learning: Scikit learn Supervised learning
Linsupervised learning | Reinforcement learning
Machine Learning: ML Libraries and Non-ML Libraries
ML Algorithms: Linear | Logistic Regression | KNN | K-means | Random | forest & more!
Deep Learning: TensorFlow, Keras | Neural Networks | CNN, RNN, GAN, LSTMS
Data Visualization Tools: Tableau | Qlikview | PowerBI
ML Engineer

Machine Learning roadmap
Machine Learning roadmap
Machine Learning roadmap


11 Machine Learning technics

Machine Learning technics


11 Machine Learning methods

Machine Learning methods


e-Books


Blogs

If you're serious about growing in AI/ML, these are top 12 blogs worth reading in 2025 ๐Ÿ‘‡

These authors build production LLM systems, ship AI features to millions of users, and share insights you won't find anywhere else:

  • 1) Andrej Karpathy (ex Tesla AI Director & OpenAI co-founder)
    Neural networks and LLMs explained from first principles by one of the OGs of modern AI.
    ๐Ÿ”— Visit Blog

  • 2) Sebastian Raschka, PhD
    Deep dives into LLM training and fine-tuning with super clear code examples.
    ๐Ÿ”— Visit Blog

  • 3) Interconnects by Nathan Lambert
    AI alignment, open-source models, and ecosystem news.
    ๐Ÿ”— Visit Blog

  • 4) Lilโ€™Log by Lilian Weng (ex VP of Research at OpenAI)
    Lessons from someone who worked on practical AI safety and alignment at OpenAI.
    ๐Ÿ”— Visit Blog

  • 5) Chip Huyen
    Real-world MLOps and production ML systems design patterns.
    ๐Ÿ”— Visit Blog

  • 6) Eugene Yan (Principal Applied Scientist at Amazon)
    Great writing on applied ML, data science, and working with recommender systems in production.
    ๐Ÿ”— Visit Blog

  • 7) Philipp Schmid (Senior AI Relation Engineer at Google DeepMind, ex Hugging Face)
    Tutorials on building and deploying LLM apps on AWS.
    ๐Ÿ”— Visit Blog

  • 8) Jason Liu
    Learn from a consultant sharing real lessons on LLMs, data, and open-source tools.
    ๐Ÿ”— Visit Blog

  • 9) Hamel H. (ex GitHub Staff ML Engineer)
    MLOps workflows, fine-tuning, and product strategy from an ML veteran.
    ๐Ÿ”— Visit Blog

  • 10) Berkeley Artificial Intelligence Research Blog
    Latest academic breakthroughs in computer vision, NLP, and robotics.
    ๐Ÿ”— Visit Blog

  • 11) Hugging Face
    Product updates, tutorials, and the latest from open-source AI.
    ๐Ÿ”— Visit Blog

  • 12) Google DeepMind
    Google's premier AI research division.
    ๐Ÿ”— Visit Blog