🌟AI/ML Dictionary
While learning all those new terms, I needed to look back occasionally. So, this dictionary is coming from those lookbacks.
Data: Machine learning algorithms rely on data to learn patterns and make predictions. The data can be structured (e.g., databases, spreadsheets) or unstructured (e.g., text, images). It is crucial to have a sufficient and representative dataset for effective learning.
Training: During the training phase, the machine learning model learns patterns and relationships in the data. The model is fed with labeled examples (known as the training set) consisting of input data and corresponding desired outputs or labels. Note that training heavily biases a model for some specific output.
Fine-tuning: Fine-tuning is an extension of training that is often referred to in the context of LLMs. This form of training typically takes a model after more generalized training and heavily biases it towards a specific type of completion like instruction following, summarization, translation, chat assistant, and other concepts. There are some advanced methods (LoRA + PEFT) that help to prevent or nearly eliminate catastrophic forgetting.
Features: Features are specific measurable properties or characteristics of the data that are used as inputs to the machine learning model. Choosing relevant and informative features is important for the model to learn effectively.
Supervised Learning: In supervised learning, the training data includes both input data and corresponding labels. The goal is to learn a mapping between inputs and outputs, enabling the model to make accurate predictions on new, unseen data.
Unsupervised Learning: In unsupervised learning, the training data consists only of input data without any corresponding labels. The goal is to discover patterns, relationships, or structures in the data, such as clustering similar data points or dimensionality reduction.
Model: The term model can refer to the software that defines/implements model architecture, the weights/biases, or both the architecture and the weights/biases combined. Model weights/biases can range in size from a single integer value of a single element up to multiple terabytes in size for the largest of LLMs. The software that defines/implements the model architecture is usually orders of magnitude smaller than the weights/biases themselves for modern LLMs.
Foundation Model: The term Foundation Models are large-scale machine learning models trained on a wide variety of internet text, serving as a base or "foundation" for numerous downstream tasks. These models, such as GPT-3 or GPT-4, exhibit general applicability across diverse tasks, even those unseen during training. They also demonstrate few-shot learning, or the ability to understand a task from minimal examples. However, they pose challenges and ethical concerns, including harmful biases and misuse potential, as they often reproduce the biases present in their training data. Efforts are underway to mitigate these issues and enhance their safety and usefulness.
Model Evaluation: After training, the model's performance is evaluated using a separate dataset called the validation or test set. Evaluation metrics, such as accuracy, precision, recall, or mean squared error, are used to assess how well the model generalizes to new, unseen data.
Prediction/Inference: Once trained, the machine learning model can be used to make predictions or decisions on new, unseen data. The model takes the input data and produces an output or prediction based on what it has learned during training.
Overfitting and Underfitting: Overfitting occurs when a model performs well on the training data but fails to generalize to new data. Underfitting happens when a model is too simple to capture the underlying patterns in the data. Balancing between these two is essential to build a well-performing model.
Catastrophic Forgetting: Catastrophic forgetting can be considered similar to overfitting, however it is typically referenced in the context of LLMs that lose entire subject areas of knowledge or behavioral capabilities when they are fine-tuned or retrained to do a specific task.
Feature Engineering: Feature engineering involves selecting, transforming, and creating features from the raw data to improve the model's performance. It requires domain knowledge and creativity to extract meaningful information that helps the model learn effectively.
Deep Learning: Deep learning is a subset of machine learning that focuses on training artificial neural networks with multiple layers (deep neural networks). Deep learning models have shown remarkable success in various domains, including computer vision, natural language processing, and speech recognition.
Supervised Learning Algorithms: Supervised learning algorithms include popular methods like linear regression, logistic regression, decision trees, random forests, support vector machines (SVM), and naive Bayes. These algorithms learn from labeled training data to make predictions or classify new data.
Unsupervised Learning Algorithms: Unsupervised learning algorithms focus on finding patterns, structures, or relationships in the data without using labeled examples. Clustering algorithms, such as k-means clustering and hierarchical clustering, group similar data points together. Dimensionality reduction techniques, such as principal component analysis (PCA) and t-SNE, reduce the number of features while retaining important information.
Reinforcement Learning: Reinforcement learning is a type of machine learning where an agent learns to interact with an environment to maximize a reward signal. The agent takes actions in the environment and receives feedback in the form of rewards or penalties. It learns through trial and error, aiming to discover an optimal policy for decision-making.
Neural Networks: Neural networks are a class of machine learning models inspired by the structure and function of the human brain. They consist of interconnected nodes called artificial neurons or "neurons." Deep neural networks (DNNs) are neural networks with multiple hidden layers, enabling them to learn hierarchical representations of data. Convolutional Neural Networks (CNNs) are commonly used for image and video analysis, while Recurrent Neural Networks (RNNs) are well-suited for sequential data, such as text or time series.
Feature Selection: Feature selection aims to identify the most relevant and informative features from a dataset. It helps to reduce dimensionality, improve model performance, and prevent overfitting. Techniques like forward selection, backward elimination, and regularization methods (e.g., Lasso or Ridge regression) can be used for feature selection.
Cross-Validation: Cross-validation is a technique used to assess the performance of a machine learning model. It involves splitting the dataset into multiple subsets (folds). The model is trained on a portion of the data and evaluated on the remaining fold. This process is repeated several times, and the results are averaged to obtain a more reliable estimate of the model's performance.
Hyperparameter Tuning: Hyperparameters are parameters that define the behavior and performance of machine learning algorithms. Hyperparameter tuning involves selecting the optimal combination of hyperparameter values to improve the model's performance. Techniques like grid search, random search, and Bayesian optimization can be used for hyperparameter tuning.
Transfer Learning: Transfer learning is a technique where knowledge learned from one task or domain is applied to another related task or domain. Instead of training a model from scratch, pre-trained models (usually trained on large datasets) can be used as a starting point. This approach saves computational resources and helps in situations with limited training data.
Ensemble Learning: Ensemble learning combines multiple individual models to make predictions or decisions. It aims to improve the overall performance and robustness of the model. Techniques like bagging (e.g., random forests), boosting (e.g., AdaBoost, Gradient Boosting), and stacking are commonly used in ensemble learning.
Bias: Bias is an overloaded term within AI/ML and is not inherently negative in its use. When discussing model bias in the context of inputs and outputs, bias is exactly what you are trying to achieve when you are training a model if you want it to respond in a specific way. You are attempting to sway a model to a specific output based on a specific input, you are biasing the output. Understand that this term is used in the normal context of AI/ML but also in the context of fairness and toxicity; always ensure you have appropriate context to interpret the meaning.
Bias and Fairness: Machine learning models can inadvertently inherit biases from the training data, leading to unfair or discriminatory outcomes. Bias and fairness considerations are essential to ensure that machine learning models treat individuals fairly and equitably across different groups. Techniques like bias detection, bias mitigation, and fairness-aware learning are actively researched areas.
Data Preprocessing: Data preprocessing is a crucial step in machine learning. It involves cleaning, transforming, and normalizing the data to improve the quality and suitability for the learning algorithm. Common preprocessing techniques include handling missing values, dealing with outliers, scaling features, and encoding categorical variables.
Imbalanced Data: Imbalanced data refers to datasets where the number of examples in different classes is significantly imbalanced. For instance, in fraud detection or rare disease prediction, the positive class (e.g., fraud cases or rare diseases) may be underrepresented. Handling imbalanced data requires techniques such as oversampling, undersampling, or using specialized algorithms like SMOTE (Synthetic Minority Over-sampling Technique).
Model Evaluation Metrics: Model evaluation metrics assess the performance of machine learning models. The choice of metrics depends on the specific problem and the type of model. Common evaluation metrics include accuracy, precision, recall, F1 score, area under the ROC curve (AUC-ROC), mean squared error (MSE), mean absolute error (MAE), and many more.
Time Series Analysis: Time series analysis deals with data that is collected over time, such as stock prices, temperature records, or sensor data. Techniques like autoregressive integrated moving average (ARIMA), seasonal decomposition of time series (STL), and recurrent neural networks (RNNs) with Long Short-Term Memory (LSTM) are commonly used for time series forecasting and analysis.
Model Deployment: Once a machine learning model is trained, it needs to be deployed in a production environment for real-world use. This involves integrating the model into an application, creating APIs for model inference, and ensuring scalability, performance, and reliability. Deployment frameworks like TensorFlow Serving, Flask, or Docker containers are often used.
Online Learning: Online learning, also known as incremental learning or streaming learning, refers to a learning approach where the model is continuously updated and adapted to new incoming data. This is particularly useful when dealing with large-scale or streaming datasets that arrive sequentially and need real-time learning and prediction.
Interpretability and Explainability: As machine learning models become more complex, there is a growing need to understand and interpret their decisions. Interpretable machine learning techniques aim to provide insights into the model's decision-making process, making it easier to understand and trust the model's predictions. Techniques like feature importance, SHAP values, and surrogate models are used for model interpretability.
AutoML: AutoML (Automated Machine Learning) refers to the use of automated tools and techniques to automatically select, preprocess, and optimize machine learning models. It aims to streamline and simplify the machine learning pipeline, making it more accessible to users without extensive knowledge of machine learning algorithms.
Adversarial Machine Learning: Adversarial machine learning focuses on studying and defending against adversarial attacks on machine learning models. Adversaries intentionally manipulate or perturb the input data to deceive or mislead the model. Adversarial techniques aim to improve model robustness and security, especially in critical applications like cybersecurity and image recognition.
Ethical Considerations: As machine learning models have a growing impact on society, ethical considerations become essential. Issues like privacy, bias, transparency, and accountability need to be addressed to ensure the responsible development and deployment of machine learning systems.
Natural Language Processing (NLP): Natural Language Processing focuses on enabling computers to understand, interpret, and generate human language. It involves techniques such as text classification, sentiment analysis, named entity recognition, machine translation, question answering, and language generation. NLP plays a crucial role in applications like chatbots, virtual assistants, and language understanding systems.
Computer Vision: Computer Vision is a field that deals with enabling computers to interpret and understand visual information from images or videos. It involves tasks such as object detection, image segmentation, image classification, facial recognition, and image generation. Computer Vision finds applications in autonomous vehicles, surveillance systems, medical imaging, and augmented reality.
Deep Reinforcement Learning: Deep Reinforcement Learning combines deep learning techniques with reinforcement learning. It involves training deep neural networks to learn policies that maximize rewards through interactions with an environment. Deep reinforcement learning has achieved remarkable success in complex tasks like playing games (e.g., AlphaGo and OpenAI Five) and robotic control.
Generative Models: Generative models are machine learning models that can generate new samples similar to the training data. Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) are popular generative models. They have applications in image synthesis, text generation, and data augmentation.
One-shot Learning: One-shot learning focuses on the problem of learning from a single or a few examples per class. It aims to develop models that can generalize and recognize new instances from limited training data. Few-shot learning and zero-shot learning are extensions of one-shot learning, addressing scenarios with a small number of examples per class.
Autoencoders: Autoencoders are unsupervised learning models that aim to learn efficient representations or encodings of input data. They consist of an encoder network that compresses the input data into a lower-dimensional representation and a decoder network that reconstructs the input from the encoded representation. Autoencoders are used for tasks like dimensionality reduction, denoising, and anomaly detection.
Bayesian Learning: Bayesian learning is an approach to machine learning that incorporates prior knowledge or beliefs about the problem. It uses Bayesian inference to update and refine the model's predictions based on the observed data. Bayesian learning helps in situations with limited data and provides uncertainty estimates for predictions.
Neuroevolution: Neuroevolution combines neural networks and evolutionary algorithms. It involves using evolutionary techniques like genetic algorithms or genetic programming to optimize the architecture or parameters of neural networks. Neuroevolution is commonly used in reinforcement learning tasks and neural architecture search.
Privacy-Preserving Machine Learning: Privacy-preserving machine learning techniques aim to protect the privacy of sensitive data while allowing for analysis and model training. Techniques like federated learning, differential privacy, and secure multi-party computation enable collaborative learning without sharing raw data.
Quantum Machine Learning: Quantum machine learning explores the intersection of quantum computing and machine learning. It investigates how quantum algorithms and quantum computers can enhance or accelerate various machine learning tasks, such as optimization, clustering, and feature selection.
Graph Neural Networks: Graph Neural Networks (GNNs) are a class of neural networks designed to process and analyze data represented as graphs. GNNs can capture relationships and dependencies among nodes and edges in complex networks. They find applications in social network analysis, recommendation systems, and molecular chemistry.
Time Series Forecasting: Time series forecasting involves predicting future values or trends based on historical data that is ordered chronologically. Techniques like autoregressive integrated moving average (ARIMA), exponential smoothing methods, recurrent neural networks (RNNs), and Long Short-Term Memory (LSTM) networks are commonly used for time series forecasting.
Anomaly Detection: Anomaly detection focuses on identifying rare or abnormal instances in a dataset that deviate from the norm. It finds applications in fraud detection, network intrusion detection, system monitoring, and outlier detection. Approaches for anomaly detection include statistical methods, clustering, and supervised or unsupervised learning algorithms.
Transfer Learning in NLP: Transfer learning has had significant impact in natural language processing (NLP). Pre-trained language models like BERT, GPT, and RoBERTa, trained on massive amounts of text data, can be fine-tuned for specific NLP tasks with limited labeled data. Transfer learning has revolutionized various NLP tasks, including text classification, named entity recognition, and text generation.
Causal Inference: Causal inference aims to understand cause-and-effect relationships between variables. It involves identifying causal effects from observational or experimental data. Causal inference techniques, such as propensity score matching, instrumental variable analysis, and causal graphical models, help in making informed decisions and understanding the impact of interventions.
Semi-Supervised Learning: Semi-supervised learning combines labeled and unlabeled data to train machine learning models. It leverages the unlabeled data to improve the model's performance and generalization. Semi-supervised learning is useful when acquiring labeled data is expensive or time-consuming, as it can make use of large amounts of readily available unlabeled data.
Domain Adaptation: Domain adaptation focuses on transferring knowledge learned from one domain (source domain) to another domain (target domain). It addresses the challenge of model performance degradation when the training data and test data come from different distributions. Domain adaptation techniques aim to align the source and target domains to improve model performance on the target domain.
Capsule Networks: Capsule Networks (CapsNets) are a type of neural network architecture introduced as an alternative to traditional convolutional neural networks (CNNs). CapsNets aim to capture hierarchical relationships between visual entities and handle spatial relationships more effectively. They show potential in tasks like object recognition, pose estimation, and image synthesis.
Responsible AI: Responsible AI focuses on developing machine learning systems that are fair, transparent, and unbiased, and that mitigate potential risks and ethical concerns. It involves addressing issues such as fairness, accountability, transparency, interpretability, and robustness in machine learning models and systems.
Adaptive Learning: Adaptive learning refers to machine learning systems that can adapt and change their behavior based on user feedback or changing environments. These systems continuously learn from user interactions and update their models to personalize the learning experience or improve performance over time.
Metaheuristic Optimization: Metaheuristic optimization algorithms are used to solve complex optimization problems that cannot be easily solved by traditional optimization techniques. These algorithms, such as genetic algorithms, particle swarm optimization, and simulated annealing, are inspired by natural processes and can efficiently explore large search spaces to find near-optimal solutions.
Multi-modal Learning: Multi-modal learning deals with data that comes from multiple sources or modalities, such as text, images, audio, or sensor data. It aims to integrate and learn from different modalities to improve performance or gain a more comprehensive understanding of the data. Multi-modal learning finds applications in areas like multimedia analysis, healthcare, and autonomous systems.
Meta-learning: Meta-learning, also known as learning to learn, focuses on developing models or algorithms that can learn how to learn new tasks or adapt quickly to new environments. Meta-learning algorithms aim to acquire knowledge and meta-knowledge from previous learning experiences to facilitate the learning of new tasks with limited data. It finds applications in few-shot learning, hyperparameter optimization, and domain adaptation.
Data Augmentation: Data augmentation techniques involve creating new training examples by applying transformations or perturbations to the existing data. These techniques help increase the diversity of the training set, improve model generalization, and reduce overfitting. Common data augmentation methods include rotation, translation, scaling, flipping, and adding noise to the data.
Privacy-Preserving Techniques: Privacy-preserving techniques in machine learning aim to protect the privacy of sensitive data during model training or inference. Techniques like secure multi-party computation, federated learning, and homomorphic encryption allow for collaborative model training or prediction while preserving data privacy.
AutoML for Neural Architecture Search: AutoML techniques, including neural architecture search (NAS), automate the process of designing or discovering optimal neural network architectures for a given task. NAS algorithms explore and optimize the architecture space by using reinforcement learning, evolutionary algorithms, or gradient-based methods.
Model Compression: Model compression techniques aim to reduce the size and computational complexity of machine learning models, making them more efficient for deployment on resource-constrained devices or systems. Techniques like pruning, quantization, knowledge distillation, and model factorization help in compressing models while maintaining performance.
Knowledge Distillation: Knowledge distillation is a technique where a smaller, more compact model, known as a student model, is trained to mimic the predictions or behavior of a larger, more complex model, known as a teacher model. This helps transfer knowledge from the teacher model to the student model, enabling the student model to achieve similar performance while being more lightweight.
Human-in-the-Loop Machine Learning: Human-in-the-loop machine learning involves integrating human expertise and feedback into the machine learning pipeline. It combines automated learning algorithms with human decision-making to create interactive and iterative learning systems. Human-in-the-loop approaches are used in tasks like active learning, interactive data labeling, and model debugging.
Neurosymbolic AI: Neurosymbolic AI combines the power of neural networks and symbolic reasoning. It aims to integrate deep learning models with symbolic representations and reasoning capabilities, enabling machines to learn from data while also leveraging human-like symbolic reasoning and logic.
Transfer Learning in Computer Vision: Transfer learning has also made significant strides in computer vision tasks. Pre-trained convolutional neural networks (CNNs) such as VGG, ResNet, and Inception are often used as feature extractors or fine-tuned for specific vision tasks, allowing for effective learning with limited labeled data.
Data Bias and Fairness: Data bias and fairness have gained considerable attention in machine learning. It involves addressing biases in training data that may lead to unfair or discriminatory outcomes. Techniques such as bias detection, bias mitigation, and fairness-aware learning aim to ensure equitable and unbiased machine learning models.
AutoML for Hyperparameter Optimization: AutoML techniques extend beyond neural architecture search. They also encompass automating the process of hyperparameter optimization, which involves selecting the best values for hyperparameters that control the learning process. Techniques like Bayesian optimization, genetic algorithms, and random search are employed for automated hyperparameter tuning.
Active Learning: Active learning is a semi-supervised learning approach that involves an iterative process of selecting the most informative or uncertain examples from a pool of unlabeled data and requesting labels from an oracle (e.g., a human expert). This approach helps optimize the learning process by actively selecting the most valuable data points to label, reducing labeling effort.
Meta-reinforcement Learning: Meta-reinforcement learning combines ideas from reinforcement learning and meta-learning. It focuses on training agents that can adapt to new and unseen reinforcement learning tasks with minimal learning or fine-tuning. Meta-reinforcement learning algorithms learn a meta-policy that generalizes across multiple tasks and guides the learning process for new tasks.
Online Recommender Systems: Online recommender systems aim to provide personalized recommendations in real-time, often in online platforms such as e-commerce, streaming services, or social media. These systems leverage techniques such as collaborative filtering, content-based filtering, and reinforcement learning to make timely and relevant recommendations to users.
Model Interpretability and Explainability in Deep Learning: As deep learning models become increasingly complex, interpretability and explainability have become critical. Techniques such as attention mechanisms, gradient-based methods (e.g., Grad-CAM), and rule extraction aim to provide insights into the decision-making process of deep learning models, making them more transparent and understandable.
Weakly Supervised Learning: Weakly supervised learning deals with scenarios where the training data is only partially labeled or labeled at a coarse level. It aims to learn models with limited supervision, leveraging techniques such as multiple instance learning, co-training, or self-supervised learning. Weakly supervised learning is useful when acquiring fine-grained labels is expensive or impractical.
Gaussian Processes: Gaussian Processes (GPs) are probabilistic models that can be used for regression, classification, and uncertainty estimation. GPs provide a flexible framework for modeling data and making predictions, offering non-parametric and Bayesian approaches. They find applications in diverse areas such as surrogate modeling, Bayesian optimization, and time series analysis.
Analogical Reasoning: Analogical reasoning involves solving problems by finding similarities and relationships between different examples or domains. It is inspired by human reasoning processes and has applications in areas such as natural language processing, image recognition, and cognitive modeling.
Meta-learning for Few-shot Learning: Meta-learning approaches are widely used for few-shot learning, where the goal is to learn new concepts or tasks with limited training data. Meta-learning algorithms aim to acquire knowledge from previous tasks to adapt quickly to new tasks with only a few examples per class.
Adversarial Machine Learning: Adversarial machine learning focuses on understanding and defending against adversarial attacks on machine learning models. Adversarial attacks involve intentionally manipulating input data to deceive or mislead the model's predictions. Defenses against adversarial attacks include robust optimization, adversarial training, and input sanitization techniques.
Automated Feature Engineering: Automated feature engineering involves the automatic generation or selection of features from raw data. It leverages techniques such as feature extraction, dimensionality reduction, and feature selection algorithms to transform raw data into meaningful and informative representations for machine learning models.
Graph Representation Learning: Graph representation learning focuses on learning meaningful and informative representations of nodes and edges in graph-structured data. Techniques like graph neural networks (GNNs), graph embedding methods, and random walk-based algorithms enable effective analysis and modeling of complex relational data.
Causal Discovery: Causal discovery aims to identify causal relationships between variables in a dataset. It involves inferring cause-and-effect relationships from observational or experimental data. Causal discovery algorithms help uncover underlying mechanisms and dependencies in complex systems.
Metaheuristic Learning to Optimize Hyperparameters: Metaheuristic algorithms, such as genetic algorithms, particle swarm optimization, or simulated annealing, can be used to optimize hyperparameters of machine learning models. These algorithms explore the hyperparameter space efficiently and find near-optimal combinations of hyperparameters.
Automated Machine Learning (AutoML): AutoML involves automating the process of designing, training, and optimizing machine learning models. It encompasses techniques like neural architecture search, hyperparameter optimization, feature selection, and model selection. AutoML aims to simplify the machine learning workflow and make it accessible to non-experts.
Unsupervised Domain Adaptation: Unsupervised domain adaptation deals with scenarios where labeled data is available in a source domain but not in the target domain. It focuses on learning representations that can generalize well across domains without relying on labeled data from the target domain. Unsupervised domain adaptation techniques bridge the gap between different domains and improve model performance on the target domain.
Machine Learning in Robotics: Machine learning plays a crucial role in robotics, enabling robots to perceive and interact with the environment, learn from data, and make intelligent decisions. Machine learning techniques, such as reinforcement learning, imitation learning, and computer vision, are used for robot control, perception, navigation, and manipulation tasks.
Agent, Agency, or Agentic Behavior: The concept of an agent, an LLM with agency, or an LLM presenting agentic behavior can be described as tool exposure and tool use by an LLM through iterative prompting and augmentation of prompting based on results of tool use. The LLM decides the path and which tools to use - the system is acting as an independent agent in the truest sense of the word.
Emergent Behaviors: Emergent behaviors are behaviors for which a model was not specifically trained or fine-tuned for, but inherently acquire through their initial training data. There is a high correlation with the number of parameters a model is built on and many of these emergent behaviors. Some examples of emergent behaviors include the ability to do math, summarization, code completion, joke explanations, logical chain of thought reasoning, and many others. Some of these behaviors may be harmless, some may be sought after, others may be hazardous or exploitable.
Transformer Model: A type of neural network that does away with recurrent neural network (RNN) and convolutional neural network (CNN) concepts and replaces them entirely with attention mechanisms. These are further broken down into encoder models, decoder models, and encoder-decoder models. Training a transformer model takes less time and ends up generalizing its training much better than RNN or CNN models.
Return Augmented Generation (RAG): RAG is a technology that allows conventional applications to augment a prompt for an LLM or other model. This concept is often associated with data enrichment as discussed by conventional ML researchers and engineers. You are effectively enriching the data, the content of the prompt, with additional information from an outside system. In the case of an LLM based chatbot, RAG may automatically enrich the user’s prompt with a wikipedia entry based on the subject extracted from the user’s prompt. This gives the LLM more information to work with and remain factual by becoming heavily biased by the input prompt with RAG content.
Token: The atomic representation of any data element that is fed into a transformer model and some other forms of NLP based machine learning models. Tokens for LLMs can consist of short phrases, words, word parts, letters, numbers, symbols, whitespace, and other non-printable characters. An LLM never sees the text itself, only an integer/float representation of this textual object; consequently, LLMs can’t spell (at least not very well). Tokens can also take the form of individual pixel information, discrete audio samples, or any other form of digitized information you train a transformer model or other type of model that uses tokens on.
Tokenizer: For LLMs, there are two basic types of tokenizers: greedy tokenizers and NLP tokenizers. The effectiveness of the tokenizer is heavily correlated with how well the model performs at abstracting information from training data. Greedy tokenizers for LLMs just match a list of words, wordparts, letters, numbers, symbols, and whitespace against the input text and convert this input text into an array of tokens (integers). NLP based tokenizers are often trained models themselves that break down language and label the constituent parts with context related to the language itself. This additional context often allows smaller models to converge faster in training.
Special Tokens: Special tokens are used by almost all LLMs and other transformer based models. These tokens can be used as delimiters, end-of-string, end-of-line, and other similar ideas that may or may not be represented by the data itself as it is fed in. These special tokens may also allow LLMs and other transformer based models to output specific styles in their text or delimit other aspects of data output in the case of language based transformers.
Prompt: Input into an LLM. Prompts can be instructions, text for completion, and other textual information. There are subcategories of prompting such as system prompting or pre-prompting, RAG prompting, and user prompting. These are often delimited through textual delimiters or special tokens. Prompting is used to bias an LLM towards a specific output, output style, or behavior.
Prompt Engineering: Heavily biasing an LLM to handle a specific task through prompt formatting, language, augmentation, and examples. Prompt engineering is inexorably tied to RAG as all of this information is contained within a prompt. Depending on how you delimit and present your information to a model and how this model was fine-tuned or trained can drastically impact how it is able to respond to your prompting. Prompt engineering is currently far more art than science and requires brute force attempts to convince the model to perform the task you are requesting until you learn its capabilities and what works. The science portion would be reading the papers about how the original model was trained, what data it was trained on, and how this training information was presented to the model. In some cases, prompt engineering will not turn up with effective responses and you’ll be forced to fine-tune or further fine-tune your your model. In other cases, you’ll determine a model is not sufficient for your use-case.
n Shot Prompting: This can take the form of zero shot, one shot, two shot, few shot, and other forms of prompt engineering. Zero shot prompting is prompting an LLM using unique structure and language that was not seen in training but may have been seen in fine-tuning. One shot prompting gives an LLM an example prompt and example response and then another prompt to address. Two shot prompting gives an LLM two example prompts and example responses and then a third prompt to address. Few shot prompting is 3-5 example prompts and example responses followed by an additional prompt to address. Typically using more than 3 examples provides diminishing returns in response quality and providing 5+ examples indicates you should probably further train/fine-tune your model or that your model may not be capable of handling the complexity of your request.
Last updated