3 Chapter 3: Technical Insights into AI

Ever planned a road trip with your friends? The first time I (Jason) ever planned a road trip on my own, I went to AAA and requested a TripTik. AAA put together a printed booklet for me. The booklet contained map after map showing me how I would get from point A to point B. They were even kind enough to go through with a highlighter and highlight the roads for me. It was a hefty little booklet that made traveling easy. AAA still has something called TripTik, but it’s definitely not like the one I got on my first road trip. Instead, I hop into my car, pull out my smartphone, open their AAA app, and plug in where I’m going, and the smartphone does everything else. The app quickly calculates the best route based on real-time traffic data, road conditions, and even your driving preferences. It’s like having a personal travel guide right in your pocket! Heck, you may not even need a smartphone, so many cars are coming pre-built with navigation systems right in the dashboard.
Have you ever wondered how your navigation app seems to know everything? How does it understand your voice commands, find the most efficient route, and even suggest alternative paths when you encounter unexpected roadblocks? Well, that’s where AI comes into play.
Just like your navigation app uses AI to make your road trip smoother and more enjoyable, AI is revolutionizing countless industries and aspects of our lives. From personalized recommendations on your favorite streaming platforms to advanced medical diagnostics and even self-driving cars, AI is reshaping the way we live, work, and interact with the world around us.
In this chapter, we’ll also describe the current state of AI, discussing its strengths and limitations, and separating the facts from the fiction. So buckle up and get ready for an exciting journey into the world of AI! By the end of this chapter, you’ll have a solid understanding of what makes AI tick and how it’s transforming our world one algorithm at a time. As you dive into this chapter, we’ll explore the core concepts, capabilities, and limitations of this transformative technology.
Core Concepts in AI
Learning Objectives
- Discuss the fundamental components of AI algorithms and how they work together to solve problems.
- Differentiate between various types of algorithms used in AI.
- Recognize the role of data structures in organizing and managing information in AI systems.
- Compare the strengths and weaknesses of common programming languages used in AI.
Have you ever worked in a kitchen? If not, they are often hot environments with a lot of people and moving pieces. Imagine you’re the head chef working in a bustling kitchen. Every day, you’re tasked with creating culinary masterpieces that delight and satisfy your hungry patrons. But behind the artistry and flair of your dishes lies a critical foundation: your recipes (the ones your grandparents passed down to you) and the organization of your kitchen.
Just as a chef relies on well-crafted recipes and a meticulously arranged kitchen to create delicious meals, AI systems depend on algorithms and data structures to process information and solve complex problems. Algorithms, like recipes, provide step-by-step instructions that guide the AI system through specific tasks, while data structures, much like the arrangement of a kitchen, ensure data is accessible and organized.
But what exactly are algorithms and data structures, and how do they work together to power AI? In this section, we’ll put on our chef’s hat and explore the fundamental concepts behind these essential components of AI systems. We’ll dive into the different types of algorithms, from classic search and sorting techniques to cutting-edge deep learning models, and discover how they solve real-world problems. We’ll also examine common data structures, such as arrays, linked lists, and trees, and understand how they enable efficient data storage and retrieval.
So, let’s sharpen our knives, fire up the stove, and embark on a culinary adventure through the world of algorithms and data structures in AI.
Algorithms

I am not a baker, but one of my favorite shows to watch on Netflix is Nailed It. If you’ve never seen the show, amateur bakers compete for cash as they create artistic confections that are realistically impossible for a novice baker to replicate, given a finite amount of time and little know-how. Why do I love the show? Because, more often than not, the bakes are absolute disasters. The whole point of the show is to watch how things can go from bad to worse. One of the most common mistakes bakers make is not following the provided recipe. Yep, they’re given the recipe and the supplies, but it’s amazing how often they can’t follow the recipe. The bakers will mix the ingredients wrong, use salt instead of sugar, leave out ingredients, or not let the items bake long enough. The examples of what can go wrong when baking happen every episode, which is what makes the show so much fun. Long before there was a show, there was a meme of people having baking and decorating fails. Figure 2.2 shows an example of what I think are supposed to be unicorn cakes.
You’re probably asking yourself, why are they talking about baking in a book on AI? In this section, we’ll discuss algorithms, which are similar to recipes. Let’s explore a basic definition of the term “algorithm.”
Defining Algorithm
An algorithm is a finite set of well-defined, step-by-step instructions that takes one or more inputs, performs a series of operations or computations, and produces one or more outputs to solve a problem or accomplish a specific task.
Finite Set of Instructions
An algorithm consists of a limited number of steps or instructions. Each instruction is clear and unambiguous, leaving no room for interpretation. The instructions are arranged in a specific order to achieve the desired outcome. This finite nature of algorithms ensures that they can be implemented in computer programs and executed in a predictable manner. In the baking world, if ingredients are not put into a cake mix in the right order, things can go horribly wrong. If you mix your dry and wet ingredients too soon, you’ll have a cake that is clumpy.
Well-defined Steps
Each step in an algorithm is precisely defined, ensuring that the process is reproducible. The steps are designed to be followed mechanically, without the need for creativity or guesswork. This well-defined nature of algorithmic steps makes them suitable for implementation in computer programs, where precision and reproducibility are essential. In baking, if you don’t follow the steps precisely, you can end up with a cake that is dense or flat, or the cake could end up with a different taste and texture.
Inputs
Algorithms receive one or more inputs, which are the data or information to be processed. Inputs can be in various forms, such as numbers, text, or even more complex data structures. The algorithm’s instructions operate on these inputs to produce the desired outputs. The ability to accept inputs allows algorithms to be versatile and adaptable to different scenarios. Obviously, the inputs need to be right. If you substitute the wrong ingredient in a cake recipe the outcome could be disastrous. Just as a friend of mine who used egg beaters instead of eggs in a cake. The egg beaters didn’t mix with the cake batter and she ended up with a dense cake-like substance that had baked egg beaters throughout the creation. Inputs matter.
Operations or Computations
The core of an algorithm lies in the series of operations or computations performed on the inputs. These operations or computations can range from simple arithmetic calculations to complex data transformations. The operations are designed to manipulate the inputs in a way that leads to the desired solution. The specific operations used in an algorithm depend on the problem being solved or the task being accomplished. In our cake metaphor, this is the actual process of mixing and baking the cake. Just as operations in an algorithm can range from simple to complex, the cake-making process can involve straightforward tasks like mixing flour and sugar or more intricate steps like folding in beaten egg whites or carefully layering and frosting the cake.
Outputs
Algorithms produce one or more outputs, which are the results or solutions to the problem at hand. Outputs can be in the form of numbers, text, data structures, or even actions to be taken. The outputs are the culmination of the algorithmic process and represent the achieved goal. The ability to generate outputs makes algorithms useful for solving problems and making decisions. If everything has gone according to plan, we have a cake.
Problem-solving or Task Accomplishment
Algorithms are designed to solve specific problems or accomplish particular tasks. The problem or task can be mathematical, computational, or related to any other domain. By following the algorithm’s instructions, the problem is solved, or the task is completed efficiently and effectively. The problem-solving nature of algorithms makes them invaluable tools in various fields, from science and engineering to business and finance. In many ways, we probably should have started with this one since the whole purpose of an algorithm is to solve a problem or accomplish a specific task. In baking it could be as simple as what are dessert are we providing dinner guests dessert or the simple task of baking the cake.

To illustrate these components, let’s consider the simple example of adding two numbers, such as 1 1 (Figure 2.3).
- The algorithm receives two inputs: the numbers 1 and 1.
- The algorithm has three parts:
- o Receive Inputs: Take the two numbers that are to be added.
- o Perform Addition: Calculate the sum of the two numbers. (1 1)
- o Produce Result: The result of the addition is computed. (1 1 = 2)
- The output is the sum of the two numbers, which is 2.
- The problem being solved is the addition of two numbers.
Algorithms are used in a wide range of applications, from simple arithmetic to complex machine learning (ML) models (more on these later in this chapter). They provide a systematic and efficient approach to problem-solving, enabling computers and other devices to perform tasks that would otherwise be difficult, time-consuming, or even impossible for humans to accomplish.
Types of Algorithms
Algorithms are the backbone of computer science and AI. They are step-by-step procedures that solve problems, process data, and make decisions. In the context of AI, algorithms enable machines to learn from experience, recognize patterns, and make intelligent choices without being explicitly programmed for every situation.
In this section, we will explore the diverse landscape of algorithms used in AI and computer science. We will discuss the five main categories: problem-solving algorithms, data processing algorithms, optimization algorithms, ML algorithms, and security algorithms (Figure 2.4). Each category encompasses a range of algorithms designed to tackle specific challenges and perform unique functions.

Problem-solving Algorithms
Problem-solving algorithms are designed to tackle specific challenges and find solutions efficiently. These algorithms are applied to a wide range of problems, from searching for information to analyzing complex networks and processing text data.
Search Algorithms
These algorithms help find specific items or information within a large amount of data. They are like strategies for finding a needle in a haystack.
Graph Algorithms
These algorithms deal with problems that involve connected items or paths between them, like finding the fastest route between two cities on a map.
String Processing Algorithms
These algorithms help analyze, compare, and manipulate text data, such as finding specific words or patterns within a large document.
Data Processing Algorithms
Data processing algorithms organize, compress, and transform data efficiently. They play a crucial role in managing and analyzing large datasets, making data more accessible and manageable.
Sorting Algorithms
These algorithms organize data in a predetermined sequence, like organizing books in a library from A to Z.
Data Compression Algorithms
These algorithms help reduce the size of data, similar to how you might pack your clothes tightly in a suitcase to save space.
Optimization Algorithms
Optimization algorithms are designed to find the best solution among many possible options. These algorithms are used in various fields, from engineering to finance, to make informed decisions and maximize efficiency.
Optimization Algorithms
These algorithms are used to sort through a large number of possibilities and select the best one, like choosing the most efficient route for a delivery truck to save time and fuel.
Dynamic Programming Algorithms
These algorithms solve complex problems by breaking them down into smaller, easier-to-solve subproblems, similar to how you might plan a multi-city trip by deciding the best order to visit each city.
Heuristic Algorithms
These algorithms use practical methods or educated guesses to solve problems quickly, even if the solution might not be perfect. It’s like using a rule of thumb to make a decision when you don’t have all the information.
ML Algorithms
ML algorithms enable computers to learn from experience and improve their performance without being explicitly programmed. These algorithms are divided into three main categories based on the type of learning involved. We’ll break these algorithms down further later in this chapter.
Security Algorithms
Security algorithms protect sensitive information and ensure secure communication in the digital world. These algorithms form the backbone of modern data protection through various specialized approaches:
Cryptography Algorithms
These algorithms focus on data encryption and decryption, transforming sensitive information into coded formats that can only be understood by authorized parties with the correct decryption keys. Think of these as machines that scramble your messages into secret codes. Only people with the right decoder ring (the “key”) can understand them. This is how your online banking passwords and private emails stay hidden from prying eyes. It’s like having a secret language only you and your friends understand.
Hash Functions
These algorithms transform input data into fixed-length codes, serving as digital fingerprints for authentication and data integrity verification. Like a tamper-proof seal on a package, any change to the original data results in a unique hash value, making unauthorized modifications immediately detectable. These are like super-fast checkers that create a unique stamp for every file or message. If anyone changes even a single letter or number, the stamp changes completely. This ensures that the files you download and the messages you receive haven’t been tampered with. Imagine a sticker that tears if a package is opened but for digital files.
Digital Signatures
These algorithms combine aspects of both cryptography and hashing to provide authentication, non-repudiation, and integrity verification. They create a unique identifier tied to both the document and the signer, similar to a handwritten signature but with cryptographic security. These are like super-secure electronic signatures that prove a document is authentic and hasn’t been altered. They’re like a combination of a secret code and a digital stamp, making them virtually impossible to fake. This is how we know that critical online documents are genuine.
Algorithms in AI
Algorithms play a vital role in helping AI systems make decisions, solve problems, and accomplish tasks. At their core, algorithms are the recipes that guide AI systems through the process of analyzing data, identifying patterns, and generating insights. By following these step-by-step instructions, AI systems can tackle complex challenges and make informed decisions without human intervention.
Imagine you’re planning a road trip from New York to Los Angeles. You input your starting point and destination into a navigation app, and within seconds, the app provides you with the best route to take, considering factors like distance, tolls, traffic, and road conditions. Algorithms enable the app to quickly process a large amount of data and offer the best solution.
Similarly, AI systems rely on algorithms to process and make sense of the data they receive. For example, a facial recognition algorithm in a smartphone learns to identify faces by analyzing thousands of images and identifying common patterns. When you use the face unlock feature, the algorithm compares the image of your face to its learned patterns, decides whether there’s a match, and unlocks your phone if it recognizes you.
Algorithms also help AI systems break problems down into smaller, more manageable parts. This approach is called “divide and conquer.” Just like how you might tackle a big project by dividing it into smaller tasks and working on them one by one, AI algorithms solve problems by breaking them down into smaller parts and solving each one individually. For example, when an AI system plays chess, it uses algorithms to analyze the current state of the board, consider possible moves, and predict the outcomes of both their moves and the likely moves of their opponent. By evaluating these subproblems and choosing the most promising moves, the AI can develop a winning strategy and make decisions that lead to victory.
Algorithms also let AI systems learn from experience and improve their performance over time. AI algorithms identify data patterns and adjust their own parameters to better solve problems in the future. This process is similar to how humans learn from our mistakes and experiences to make better decisions in the future.
Data Structures

I love libraries. When I was a first-year student in college, I remember having to flip through the card catalog to find anything in the library. Often, when I found an entry that looked promising, I would go sit in the stacks and look at the book that caught my attention and all the other books that were around that one. Thankfully, libraries have been using the Dewy Decimal System since Melvil Dewy published his ideas in 1876. Card catalogs, just like the libraries themselves, are highly structured in how they store information.
Data structures are specialized formats for organizing, managing, and storing data in a computer so that it can be accessed and modified efficiently. They provide a way to arrange data to suit specific purposes, facilitating various operations such as data retrieval, insertion, deletion, and traversal. Data structures are like the organization systems used in this library to arrange the books. For example, an array is like a simple bookshelf where books are placed side by side in a specific order. A linked list is akin to a chain of books where each book has a reference to the next one. Trees are like the hierarchy of sections and subsections in the library, with each branch leading to more specific categories. Graphs represent interconnected networks of books on related topics, much like a web of knowledge. Just as an organized library helps you find and retrieve books quickly, efficient data structures enable computer systems to store, access, and manipulate data swiftly and effectively. In this section, we’re going to explore the importance of data structures in AI.
What are Data Structures?
Data is like the building blocks of information, much like individual items in your wardrobe. Just as clothing pieces are combined to create various outfits, data is used to form information that computers and humans can understand and use. This data can be anything from numbers and text to images and sounds.
Imagine you need to store and organize these clothing items in a way that makes them easy to find and use. This is where “data structures” come into play. Data structures are like the organizers in your closet. They help us sort and keep track of our data, just as shelves, drawers, and hangers help keep our clothes orderly.
When you think about arranging your closet, you might decide to hang dresses on hangers, fold sweaters and place them in a stack on a shelf, and place shoes on a shoe rack on the ground. Each method of organization suits the type of clothing and how often you use it. If it’s the summer, we may store our winter clothes or hang our giant puffy jackets in the back of the closet because we don’t need to wear them for a while. Similarly, data structures provide a structured approach to handling information. Some common types of data structures include:
- Arrays are like rows of hooks for hanging your everyday items. They keep data in a linear order, making it easy to reach what you frequently use by counting along the row.
- Linked Lists are more like a charm bracelet, where each charm hangs off the last. This setup allows you to easily add or remove elements without rearranging the whole bracelet.
- Trees structure data like a family tree or a tree of decision branches, helping you follow paths of decisions or relationships to reach an answer.
- Hash Tables work like a cubbyhole system where items are stored based on a label or a key, allowing for quick retrieval based on that label.
By organizing data in these different structures, it becomes faster and more efficient to search for specific information, add new data, or remove data that are no longer needed. This efficient organization is key to managing large quantities of data, making it quicker to find and use exactly what you need, just as organizing your closet makes it easier to grab that sweater you love on a chilly morning.
Importance of Data Structures in AI

Data structures make AI work effectively. Think of AI as a master chef who needs to prepare a grand feast, dealing with a vast assortment of ingredients, each needing to be stored, retrieved, and combined in precise ways to create delicious dishes. Without a well-organized kitchen—where spices are neatly labeled, tools are within reach, and ingredients are stored for easy access—the cooking process would be chaotic, prone to mistakes, and slow.
Because we love cakes in this chapter, imagine you’re a pastry chef working on a multi-teared wedding cake. Each layer might require different ingredients and treatments. An organized chef will have their pantry neatly organized so they can find all the equipment and ingredients they’ll need when they need them. If it’s time to use food coloring, you don’t want to spend an hour searching for the little bottles. In AI, similar to how a chef uses different shelves or sections in a pantry to organize ingredients for easy access, we use different data structures to manage data. For instance, you could use arrays, much like shelves in a pantry, to store the pixel values of images in an orderly manner. These arrays help you quickly grab the ‘ingredients’ you need without searching through every corner of the kitchen.
Similarly, when dealing with the relationships between different flavors or elements of a cake, a chef might visualize this as a tree—a structure with roots, branches, and leaves, each representing how one flavor influences another. In a German chocolate cake, it’s that combination of the chocolate cake flavor combined with the coconut-pecan frosting that brings the cake to life. In an AI project, like image recognition, using a tree structure allows you to manage and explore the connections between various features of an image efficiently, much like how a chef uses a recipe tree to keep track of which flavors come from which ingredients.
By organizing data in these structured ways, an AI system, like a skilled chef, can swiftly locate the necessary ‘ingredients’ (data), combine them in the right ways, and ‘bake’ up solutions (predictions or results) that are both speedy and accurate, greatly enhancing the model’s performance and output. This approach not only saves time but also improves the quality of the dishes (or AI outputs) produced.
Relationship Between Data Structures and Algorithms

Have you ever been to IKEA? If you have, then you’ll know that everything comes in flat boxes that you have to assemble yourself. Figure 2.6 shows just one row in an IKEA warehouse – shelves filled with flat-packed furniture waiting to be purchased, unpacked, and built.
In computer science, data structures and algorithms are closely intertwined. To understand their relationship, let’s use the IKEA furniture analogy:
Formula 1: Data + Rules for organization = Data Structures
In the IKEA analogy, this translates to: Furniture parts + Organization system = Organized kit
The furniture parts (screws, panels, brackets) represent the raw data. The organization system (labeled bags, numbered components) represents the rules for organizing this data. Together, they form an organized kit—similar to how data structures provide organized ways to store and access data.
Formula 2: Data Structures + Algorithms = Programs
Using the IKEA metaphor, this can be represented as: Organized kit + Assembly instructions = Finished furniture
The organized kit (data structures) combined with assembly instructions (algorithms) leads to the finished furniture. In the same way, when you apply algorithms to data structures in computer science, you create functioning programs.
Just as you need both well-organized parts and clear step-by-step assembly instructions to build IKEA furniture successfully, you need both efficient data structures and effective algorithms to build a working computer program. The data structure determines how your information is organized and stored, while algorithms provide the logical steps to process that information and achieve your desired outcome.
This relationship is fundamental to computer science: choosing the right data structure makes certain algorithms more efficient, just as a well-organized furniture kit makes assembly much easier and faster.
Real-World Case Studies
Let’s look at two real-world case studies that demonstrate how data structures and algorithms are used in AI applications, keeping in mind that our readers may have little to no background in computer science.
Case Study 1: Recommender Systems

Imagine you’re browsing an online shopping platform like Amazon and notice that the website suggests products you might be interested in. These suggestions are often spot-on, making you wonder how the platform seems to know your preferences so well. This is the work of a recommender system, a type of AI application that uses data structures and algorithms to provide personalized recommendations.
Here, the data (related data) consists of information about users, products, and their interactions, such as ratings, purchases, or views. This data is organized into data structures like matrices, where each row represents a user, each column represents a product, and the values in the matrix represent the user’s interaction with the product (permissible operations).
The recommender system then uses algorithms, like collaborative filtering, to analyze the data and find patterns. It might look for users who have similar interaction patterns to you (assembly process) and recommend products that those similar users have liked, but that you haven’t tried yet (finished furniture). This process involves complex mathematical operations, but the key idea is that the algorithm uses the organized data to make personalized suggestions.
By efficiently organizing and processing large amounts of data using data structures and algorithms, recommender systems can provide tailored recommendations to millions of users, enhancing their shopping experience and helping them discover new products and services they might purchase.
Case Study 2: Fraud Detection

Have you ever received a text alert from your bank saying that something seems a bit fishy? Well, banks that want to protect their customers from fraudulent transactions. With millions of transactions happening every day, manually checking each one would be impossible. This is where AI-powered fraud detection systems come into play, using data structures and algorithms to identify suspicious activities.
The data, in this case, includes information about transactions, such as the amount, location, time, and parties involved (related data). This data is organized into data structures like trees or graphs (data structures), where each node represents a transaction, and the edges between nodes represent the relationships between transactions.
The fraud detection system then uses algorithms, like decision trees or graph traversal algorithms (assembly process), to analyze the structured data and identify patterns that might indicate fraudulent behavior. For example, if a single account suddenly has a burst of transactions from different locations within a short time, the algorithm might flag this as suspicious activity.
By organizing the transaction data into efficient data structures and applying sophisticated algorithms, the fraud detection system can quickly sift through millions of transactions, spotting potential fraud in real time (finished furniture). This helps the bank prevent financial losses and protects its customers from the negative impacts of fraudulent activities.
Programming Languages for AI
Most of us will not be computer programmers. For that reason, we will not spend a lot of time focusing on computer programming in this text. Obviously, AI is built on computer programming and the software engineers and computer programmers who specialize in these abilities. However, we think it’s important to at least have a basic working knowledge of what the languages are so you’ll at least know what computer scientists are talking about. On a basic level, programming languages are like the languages we speak—tools that help us communicate. In the world of AI, these languages let developers give instructions to computers to create intelligent systems that can perform tasks like recognizing speech or predicting trends. Again, our goal over the next few paragraphs is not to create computer programmers but to introduce you to some of the major programming languages that exist and how they aid in AI development. Remember, this is a book on AI literacy, not on becoming a computer programmer. If you want to delve more into computer programming, we would recommend taking a college course or one of the courses readily available for each of these programming languages online. However, we also acknowledge that many AI industry insiders are questioning the necessity of learning to program since so much of this can be handled now by AI itself.[1] [2]
Overview of Common Programming Languages in AI Development
Among the many programming languages, Python, R, Java, and C are particularly popular for AI tasks.
- Python is praised for its simplicity and a vast array of libraries, making it a go-to for tasks like data analysis and learning algorithms. Python is user-friendly and has a huge support community, making it a great starting point for beginners. However, it might not handle very heavy data tasks as efficiently as some other languages.
- R specializes in statistics and making data understandable through visuals, which is why it’s a favorite among statisticians. R is excellent for statistical calculations and graphs but might become sluggish with very large datasets
- Java and C are like the heavy lifters in AI; they’re powerful and efficient, perfect for handling complex AI systems that need to run smoothly and quickly. Java and C are fast and powerful, but they require a deeper understanding to use effectively, making them more challenging for newcomers.
Libraries and Frameworks for AI Development
A carpenter uses tools to build furniture, programmers use libraries and frameworks to build AI systems. These tools, like TensorFlow, PyTorch, and Keras (see sidebar), offer ready-to-use building blocks for creating complex AI models. This means developers can focus more on solving the problem at hand than getting bogged down by the detailed mathematics of AI.
Choosing the Right Programming Language

Choosing the right language and tools for an AI project is like selecting the right material and tools for a building project; it determines your level of success. Factors to consider include:
- Project needs: What specific tasks does the project require? Heavy data analysis might lean toward R, while a need for speed and efficiency might point toward C.
- Team skills: What languages are your team members proficient in?
- Scalability: Will the system need to handle growing amounts of data or users smoothly?
How Programming Languages Interact with AI Hardware
Programming languages communicate with AI hardware through specific tools that translate human-friendly code into machine-friendly code. This translation is crucial because it ensures that the AI system can understand and execute the commands efficiently, much like how a translator would help two people who speak different languages communicate effectively.
Exploration of the Development Communities and Ecosystems Around These Languages
The thriving communities around these programming languages are a huge part of their success. These communities are akin to vibrant city neighborhoods where people exchange ideas, tools, and resources. They offer everything from tutorials and advice to free code libraries, making it easier for everyone to build AI systems and continue improving their skills in AI development.
By understanding and leveraging these languages and tools, developers can more effectively guide AI systems to process information and solve problems, much like a director leads actors in a play to ensure a successful performance.
Sidebar – Overview of Key ML Frameworks
TensorFlow, PyTorch, and Keras are frameworks that developers use to build and train ML models, which are a core component of AI systems. These frameworks provide tools to make it easier to construct, test, and deploy AI models without needing to understand complex mathematics and programming from scratch. Here’s a brief overview of each:
TensorFlow
Developed by Google, TensorFlow is one of the most widely used ML libraries. It handles a wide range of tasks, but it’s particularly strong in the field of deep learning. TensorFlow is renowned for its versatile and extensive array of tools, libraries, and community resources, which facilitate the effortless construction and deployment of AI models. It supports both central processing units (CPUs) and GPUs, which makes processing faster and more efficient.
PyTorch
PyTorch was developed by Meta’s AI Research lab and is favored for its simplicity and ease of use, especially when it comes to prototypes and experiments with deep learning models. It’s particularly popular in the academic and research community because of its dynamic computation graph and efficient memory usage, which allows for adjustments and changes on-the-fly. PyTorch also supports GPU acceleration which significantly speeds up the computing processes involved in training and running models.
Keras
The Keras toolkit is a high-level API for neural networks written in Python. It can be used with TensorFlow, Microsoft Cognitive Toolkit, or Theano. It was made to allow users to try new things quickly. Keras is known for its user-friendliness and modularity, making it a great choice for beginners who are learning how to develop neural networks. It abstracts away many of the complexities of building a neural network, making it more accessible and easier to handle.
All three frameworks are instrumental in advancing the use and development of AI technologies by simplifying the process of ML model development, enabling developers to focus more on the problem at hand rather than the underlying algorithmic complexity.
Key Takeaways
- AI algorithms are step-by-step instructions that take inputs, perform operations, and produce outputs to solve problems or accomplish specific tasks.
- Different types of algorithms, such as search algorithms, optimization algorithms, and ML algorithms, are used in AI to tackle various challenges and enable intelligent decision-making.
- Data structures, such as arrays, linked lists, trees, and graphs, play a crucial role in AI by organizing and storing data efficiently, allowing for quick access and manipulation.
- Programming languages like Python, R, Java, and C provide the tools and frameworks necessary for implementing AI algorithms and building intelligent systems, each with its own strengths and weaknesses.
Exercises
- Choose a simple AI algorithm and create a flowchart or diagram that visualizes its step-by-step process. Use clear labels and symbols to represent the algorithm’s components and decision points, and write a brief explanation of how it works and its applications in AI systems.
- Select three AI problem scenarios from the provided list and analyze the efficiency of using different data structures for each scenario. Consider factors such as time complexity, space complexity, and ease of implementation, and create a table or chart comparing the data structures’ performance for each scenario.
- Choose a programming language commonly used in AI development and research its strengths, weaknesses, and popular libraries or frameworks. Create a beginner-friendly tutorial that introduces the language and its AI-related features, including code examples, explanations, and practical exercises.
Machine Learning (ML)
Learning Objectives
- Describe machine learning by outlining its purpose and key components.
- Describe the differences between the types of machine learning.
- Discuss how algorithms are used to detect patterns and make decisions based on data.
- Evaluate the benefits and limitations of using machine learning in various applications.

“Machine Learning” was coined by A. L. Samuel in 1959.[3] It’s all about teaching computers to learn and make decisions on their own, without being explicitly programmed. Imagine giving a computer the ability to learn from experience, just like we do as humans. ML has received a lot of attention in recent years, but it’s not new.
Defining ML
So just what is ML? Machine learning (ML)is a subfield of AI enabling computers to use data to improve their performance without explicit programming, using algorithms to detect patterns and make descriptive (explaining past events), predictive (forecasting future outcomes), or prescriptive (recommending actions) decisions with minimal human intervention.
Let’s break down this definition into its core components and explore each one in more detail. We’ll use some examples and metaphors along the way to help clarify the concepts for those who might be new to the world of AI and ML.
Subfield of AI
ML is a specific area within the broader field of AI. AI focuses on creating intelligent machines that can perform tasks typically requiring human-like intelligence, such as recognizing speech, translating languages, or making decisions. ML is one of the key approaches used to achieve AI.
Improving Computer Performance
The core idea behind ML is that computers can learn and improve their performance by leveraging data, without being explicitly programmed for each specific task. Just like humans learn from experience and examples, ML algorithms can learn from vast amounts of data to identify patterns and relationships that might be too complex for humans to spot.
Imagine teaching a child to recognize different animals. Instead of providing a set of specific instructions like “if it has a trunk, it’s an elephant; if it has a long neck, it’s a giraffe,” you would show the child many pictures of animals and let them learn from those examples. ML works similarly; by feeding the algorithms lots of data, they can learn to recognize patterns and make decisions on their own.
Using Algorithms to Detect Patterns
ML relies on algorithms. These algorithms can identify patterns, relationships, and insights hidden in the data, and use that knowledge to make predictions or decisions about new, unseen data.
Describe, Predict, and Prescribe
ML systems can be classified based on the types of insights they provide and the actions they enable. These systems typically fall into one of three categories: descriptive, predictive, and prescriptive.
Descriptive
These systems use data to explain or summarize what has happened in the past. They help understand and interpret historical data. For example, a descriptive ML model could analyze customer purchase history to identify which products are frequently bought together.
Predictive
These systems use data to forecast or predict future outcomes. They learn from past data to make informed guesses about what might happen in the future. For example, a predictive ML model could use historical weather data to predict the temperature in a city for the next week.
Prescriptive
These systems use data to recommend actions or decisions. They not only predict what might happen but also suggest the best course of action based on the predictions. A prescriptive ML model could analyze a company’s financial data and market trends to recommend whether it should invest in a particular stock.
Minimal Human Intervention
One of ML’s main advantages is its ability to automate various processes and tasks that would otherwise require a large amount of human labor. Once the algorithms are trained on a large enough dataset, they can work independently to make decisions and predictions without constant human oversight.
ML is like a self-driving car. Just as a human driver needs to learn the rules of the road and gain experience behind the wheel, a self-driving car must be trained on a vast dataset of driving scenarios and examples. However, once trained, the car can navigate on its own, making real-time decisions and adjustments based on the patterns and insights learned during training. The human designer doesn’t need to control every move the car makes manually.
Supervised Learning
Supervised learning is a type of ML where the algorithm learns from labeled data. In this approach, the algorithm is trained on a dataset that comprises input-output pairs, where each input is associated with a corresponding correct output or label. The goal of supervised learning is to learn a function that maps the input data to the desired output, enabling the algorithm to make accurate predictions or decisions on new, unseen data.
Imagine you’re teaching a child to recognize different fruits. You show the child various pictures of apples, bananas, and oranges, and tell them the name of each fruit. This is essentially supervised learning—you provide the child with labeled examples (input-output pairs) to learn from. After seeing enough examples, the child should be able to recognize and correctly classify new fruit images they haven’t seen before.
Supervised learning algorithms can be broadly categorized into two main types: classification and regression. Before we continue, we want to note that this section will list several commonly used mathematical models or statistical tests. For our purposes, it’s good to know that these terms exist and are a type of classification of supervised learning, but explaining what they are and how they function is material for advanced courses in AI mathematics.
Classification

If you’ve ever taught someone to distinguish between shapes, you’ve experienced a form of classification learning. Let’s think through how one does this. For our example, let’s imagine that our goal is just to get someone to do two things: determine if a shape IS a triangle and whether a shape IS NOT a triangle. Simple, huh? Not quite. You’d start by gathering a bunch of flashcards with various shapes. You’d show your learner card after card, asking each time, “Is this a triangle?” Your learner would guess, and you’d tell them if they’re right or wrong, explaining the characteristics of triangles when they make a mistake. Through this process of guessing and correction, your learner will gradually improve their ability to recognize triangles.
In 1958, psychologist Frank Rosenblatt decided he wanted to train a computer to do just that—differentiate between triangles and non-triangles.[4] Rosenblatt’s experiment with computers followed a similar pattern, but instead of flashcards, he used a dataset of images. His computer, like your hypothetical learner, started with no knowledge of triangles. Rosenblatt would show the computer an image and ask it to guess whether it was a triangle. Initially, the computer’s guesses were random, but Rosenblatt would inform it whether each guess was correct or incorrect. When wrong, the computer would adjust its criteria for judging triangles, much like a human learner might refine their understanding after incorrectly identifying a triangle on a flashcard.
In Rosenblatt’s experiment, the process of showing images, guessing, and adjusting was repeated many times. With each iteration, the computer fine-tuned its ability to recognize triangles, improving its accuracy over time. Finally, to test whether the computer had truly learned, Rosenblatt presented it with new, unseen images to classify.
This approach to “supervised learning” is a form of “classification” since the task involved categorizing images into two groups: triangles and non-triangles. The remarkable aspect of this experiment was that the computer wasn’t given explicit rules about what makes a triangle. Instead, it learned to recognize triangles through trial, error, and adjustment—mimicking how humans often learn.
Classification is a supervised learning task where the goal is to predict a categorical or discrete output variable. The algorithm learns to assign input data to specific, predefined categories or classes. The output variable in classification is often referred to as the “target” or “label.”
Some common examples of classification problems include:
- Email spam detection: Given an email, classify it as either “spam” or “not spam.”
- Image recognition: Given an image, classify it as a specific object or entity (e.g., “dog,” “cat,” “car”).
- Sentiment analysis: Given a piece of text (e.g., a customer review), classify it as expressing a positive, negative, or neutral sentiment.
Sidebar – Popular Classification Algorithms
Without delving into a lot of detail here, there are many popular algorithms used for classification tasks, including:
- Decision Trees: Classifying data by splitting it into subsets based on feature values, creating a tree where each node represents a decision based on a feature.
- k-Nearest Neighbors (k-NN): Classifying data points based on the classes of their k-nearest neighbors in the feature space.
- Logistic Regression: Modeling the probability of a binary outcome based on one or more predictor variables using a logistic function.
- Naive Bayes: Classifying data based on Bayes’ theorem, assuming that features are independent given the class, and using probabilities to predict the class.
- Random Forests: Combining multiple decision trees to improve classification accuracy by averaging their predictions, reducing overfitting.
- Support Vector Machines (SVM): Finding the optimal boundary (hyperplane) that separates different classes by maximizing the margin between them.
These algorithms learn from the labeled training data to find patterns and decision boundaries that separate different classes. Once trained, they can predict the class or category of new, unseen input data.
Regression
Regression is a supervised learning task where the goal is to predict a continuous numerical value. Unlike classification, where the output is a discrete category, regression aims to estimate the relationship between input variables and a continuous target variable. In this context, input variables (also called features or predictors) are the characteristics or measurements that the model uses to make predictions—for example, a house’s square footage, number of bedrooms, or neighborhood. The continuous target variable is the numerical outcome we want to predict, such as a house’s price or estimated value. For instance, a regression model might use input variables like square footage, number of bedrooms, and neighborhood to predict the continuous target variable of house price. Regression models learn to understand how changes in these input variables correspond to changes in the target variable, allowing them to make predictions on new, unseen data points.
Some common examples of regression problems include:
- House price prediction: Given various features of a house (e.g., size, number of bedrooms, location), predict its price.
- Stock market forecasting: Given historical stock prices and other market data, predict a stock’s future price.
- Weather temperature prediction: Given historical weather data and current conditions, predict the temperature for the next day.
Sidebar – Popular Regression Algorithms
Popular algorithms used for regression tasks include:
- Lasso Regression: Performing linear regression with L1 regularization, which can shrink some coefficients to zero, effectively selecting features and reducing complexity.
- Linear Regression: Modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data.
- Polynomial Regression: Extending linear regression by modeling the relationship between the dependent and independent variables as an nth degree polynomial.
- Random Forests: Combining multiple decision trees to improve prediction accuracy by averaging their predictions and reducing overfitting.
- Ridge Regression: Performing linear regression with L2 regularization, which helps prevent overfitting by shrinking the regression coefficients.
- Support Vector Regression: Using support vector machines for regression tasks by finding a hyperplane that fits the data with the maximum number of points within a specified margin.
These algorithms learn from the labeled training data to find the best-fit relationship between the input variables and the continuous target variable. Once trained, the models can predict the numerical value of the target variable for new, unseen input data.
These algorithms learn from the labeled training data to find the best-fit relationship between the input variables and the continuous target variable. Once trained, the models can predict the numerical value of the target variable for new, unseen input data.
In both classification and regression tasks, the labeled training data is crucial for the algorithm to learn the underlying patterns and relationships. The quality and quantity of the training data directly impact the performance and generalization ability of the supervised learning model.
After training, a supervised learning model is evaluated using a separate test dataset that the model hasn’t seen during training. This helps assess how well the model generalizes to new, unseen data and provides an estimate of its real-world performance.
Unsupervised Learning

Unsupervised learning is ML where the algorithm learns from unlabeled data. Unlike supervised learning, where the data is labeled with the correct output, unsupervised learning algorithms work with input data that doesn’t have a corresponding output or target variable. The goal of unsupervised learning is to discover hidden patterns, relationships, or structures without specific guidance.
Imagine you’re given a stack of photographs with no labels or descriptions. Your task is to organize these photographs into meaningful groups based on their visual similarities. This is an example of unsupervised learning—you’re trying to find inherent patterns or clusters in the photographs (data) with no predefined categories or labels. Unsupervised learning algorithms can be broadly categorized into three main types: clustering, dimensionality reduction, and anomaly detection.
Clustering
Clustering is an unsupervised learning task where the goal is to group similar data points based on their inherent characteristics or features. The algorithm aims to partition the data into distinct clusters, where data points within the same cluster are more similar to each other than to data points in other clusters. Let’s go back to organizing our closet. Think of clustering like a closet full of unlabeled clothes. There is just a big bundle of clothes on the ground in a heap. Without predefined categories, you naturally group similar items together—placing all shirts in one section, pants in another, and shoes on a separate rack. You didn’t need labels to create these groups; you recognized the natural similarities. Unlike supervised learning techniques that rely on labeled examples, clustering algorithms discover natural groupings in data without being told in advance what those groups should be. For example, an e-commerce company might use clustering to identify customer segments based on purchasing behavior, or a streaming service might group similar songs to create personalized playlists, all without having pre-defined categories to work with. This ability to find structure without guidance is why clustering is a fundamental technique in unsupervised learning.
Some common examples of clustering applications include:
- Customer segmentation: Grouping customers based on their purchasing behavior, demographics, or preferences to tailor marketing strategies.
- Document clustering: Organizing a large collection of documents into topics or themes based on their content similarity.
- Image segmentation: Partitioning an image into multiple segments or regions based on pixel similarities, such as grouping areas with similar colors or textures to identify distinct objects or boundaries in the image.
Sidebar – Popular Clustering Algorithms
Popular algorithms used for clustering tasks include:
- K-means clustering: Grouping data points into a specified number of clusters based on their similarity, where each cluster is represented by the mean of the data points within it.
- Hierarchical clustering: Building a tree of clusters by either progressively merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive) based on their similarity.
- Density-Based Spatial Clustering of Applications with Noise: Clustering data points based on their density, identifying areas of high density as clusters and marking points in low-density areas as noise.
- Gaussian Mixture Models : Clustering data points by assuming that the data is generated from a mixture of several Gaussian distributions, each representing a different cluster, and using probabilities to assign data points to clusters.
These algorithms learn from the unlabeled data by identifying patterns and similarities among the data points, with no predefined categories or labels.
Dimensionality Reduction
Dimensionality reduction is an unsupervised learning technique used to reduce the number of features or variables in a dataset while retaining the most important information. The goal is to transform high-dimensional data into a lower-dimensional space, making it easier to visualize, analyze, and process.
Some common applications of dimensionality reduction include:
- Data visualization: Reducing high-dimensional data to 2D or 3D representations for better visualization and understanding, such as transforming complex patient medical data with dozens of measurements into a simple scatter plot where doctors can see natural groupings of patient conditions.
- Feature extraction: Identifying the most informative features or variables in a dataset to improve model performance and efficiency, like distilling thousands of pixels in face recognition down to key facial measurements that capture the essence of each person’s unique appearance.
- Data compression: This involves compressing large datasets by removing redundant or less informative features, saving storage space and computational resources. It is similar to how image compression might preserve the important details of a picture while reducing its file size.
Sidebar – Popular Dimensionality Reduction Algorithms
Popular algorithms used for dimensionality reduction tasks include:
- Autoencoders: Neural networks designed to learn efficient representations of data by compressing input into a lower-dimensional space and then reconstructing it.
- Locally Linear Embedding: Reducing dimensionality by preserving local relationships between data points, using linear combinations of neighboring points to project data into a lower-dimensional space.
- Principal Component Analysis : Transforming data into a lower-dimensional space by finding the directions (principal components) that maximize variance and preserve the most information.
- t-Distributed Stochastic Neighbor Embedding: Visualizing high-dimensional data by converting similarities between data points into joint probabilities and minimizing the Kullback-Leibler divergence between the original and reduced-dimensional space.
These algorithms learn from unlabeled data to find the most important or informative features, allowing for a more compact and efficient representation of the data.
Anomaly Detection
Anomaly detection, also known as outlier detection, is an unsupervised learning task that aims to identify rare, unusual, or suspicious data points that deviate significantly from the norm. The goal is to detect instances that don’t conform to the expected patterns or behaviors in the dataset.
Some common applications of anomaly detection include:
- Fraud detection: Identifying unusual financial transactions or behaviors that may show fraudulent activities.
- Network intrusion detection: Detecting suspicious network traffic or activities that may indicate a security breach or cyberattack.
- Equipment monitoring: Identifying unusual sensor readings or machine behavior that may show potential failures or malfunctions.
Sidebar – Popular Anomaly Detection Algorithms
Popular algorithms used for anomaly detection tasks include:
- Autoencoder-based Methods: Using neural networks to learn a compressed representation of data and detect anomalies by measuring the reconstruction error, with higher errors indicating potential anomalies.
- Isolation Forest: Detecting anomalies by randomly partitioning the data and isolating points that require fewer partitions, as anomalies are generally less frequent and more isolated.
- Local Outlier Factor: Identifying anomalies by comparing the local density of a data point to that of its neighbors, with lower-density points being considered outliers.
- One-Class SVM: Using support vector machines to model the distribution of normal data and identify anomalies as points that fall outside the learned region.
These algorithms learn from the unlabeled data to establish a baseline of normal behavior or patterns and then identify instances that deviate significantly from that baseline.
Unsupervised learning is useful when processing large, complex datasets where manual labeling is impractical or expensive. By discovering hidden patterns and structures in the data, unsupervised learning algorithms can provide valuable insights, help in data exploration, and serve as a precursor to supervised learning tasks. However, evaluating the performance of unsupervised learning models can be more challenging than supervised learning, as there are no predefined correct answers or labels against which to compare. Often, domain expertise and qualitative assessment are required to interpret and validate the results of unsupervised learning algorithms.
Reinforcement Learning

Reinforcement learning (RL) is a type of ML where an agent (see sidebar) learns to decide by interacting with an environment. Unlike supervised learning, where the algorithm learns from labeled data, or unsupervised learning, where the algorithm learns from unlabeled data, RL focuses on learning through trial and error. The agent learns to take actions that maximize a reward signal.
Imagine teaching a robot to navigate a maze. The robot starts with no prior knowledge of the maze layout or the optimal path to reach the goal. As the robot explores the maze, it receives feedback as rewards or penalties. Positive rewards are given for actions that bring the robot closer to the goal, while penalties are given for actions that lead to dead ends or collisions. By learning from this feedback, the robot gradually discovers the best sequence of actions to navigate the maze successfully.
RL algorithms can be broadly categorized into three main types: Q-learning, policy gradients, and actor-critic methods.
Sidebar – What is an “Agent” in RL?
In RL, the term “agent” refers to an entity that interacts with an environment to achieve a goal. The agent makes decisions, takes actions, and learns from the outcomes of those actions to improve its performance over time. Here’s a breakdown of the key components and roles of an agent:
- Environment: The world in which the agent operates. It provides the context for the agent’s actions and returns feedback in the form of rewards and new states.
- States: Different situations or configurations in the environment that the agent can be in. Each state provides information that the agent uses to decide its next action.
- Actions: The possible moves or decisions the agent can make in a given state. Actions affect the state of the environment and lead to new states.
- Rewards: Feedback from the environment that tells the agent how good or bad its actions are. The agent’s goal is to maximize the total reward over time.
- Policy: The strategy or set of rules that the agent follows to decide which actions to take in different states. In policy gradient methods, the policy is directly optimized to improve performance.
- Value Function: A function that estimates the expected reward of being in a particular state or taking a specific action in that state. Q-learning focuses on learning the value function.
Example
Imagine a robot trying to navigate a maze to find a treasure. The robot is the agent, the maze is the environment, and each position in the maze is a state. The robot can take actions like moving north, south, east, or west. It receives rewards based on its progress toward the treasure. The robot’s policy guides its movements through the maze, and over time, it learns the best paths to maximize its reward.
By understanding the concept of an agent, we can better grasp how RL algorithms work to train these agents to make intelligent decisions.
Q-Learning
https://youtube.com/watch?v=aLw43abG-NA%3Fsi%3Drqwc9asOqxCesh1f
Q-learning is a powerful reinforcement learning algorithm that helps agents learn optimal behaviors through trial and error. The Mario Kart AI project in the video demonstrates this approach brilliantly.[5]
As explained by the researchers, the Mario Kart AI isn’t given explicit instructions on how to race. Instead, it learns by observing the screen and receiving rewards that indicate how well it’s performing.
Here’s how Q-learning works in the context of the Mario Kart AI:
- Observing the screen: The AI only sees what a human player would see – the game visuals.
- Receiving rewards: The system gets feedback on whether it’s doing well or poorly.
- Recognizing patterns: It learns which visual cues lead to which rewards.
- Making decisions: The AI selects actions it believes will maximize future rewards.
When training began, the AI had no idea how to drive, bumping into walls and falling off tracks. However, after just an hour of training, it started understanding basic patterns like staying on the road and avoiding obstacles.
The “Q” in Q-learning represents the quality of taking a specific action in a specific situation. The AI maintains a table (or in more complex implementations, a neural network) that estimates how valuable each action is in each state. This is represented mathematically as:
Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)−Q(s,a)]
Where:
- s is the current game screen (what the AI sees)
- a is the racing action taken (turn, accelerate, use item)
- r is the immediate reward (like +1 for hitting a checkpoint)
- s′ is the next game screen after taking the action
- maxa*′Q(s′,a′) represents the estimated best future value from the new screen
- α is how quickly the AI learns from new experiences
- γ is how much the AI values future rewards versus immediate one
For example, when the AI approaches a sharp turn in Rainbow Road:
- It sees the current state (approaching the turn)
- It estimates the value of different actions (keep accelerating vs. drift)
- After taking an action, it observes the result (staying on track or falling off)
- It updates its Q-value based on the reward and the new situation
As seen in the video, after three weeks of training, the AI went from randomly crashing to executing advanced strategies. It learned to:
- Navigate complex tracks like Rainbow Road
- Use mushrooms strategically to clear ink from bloopers
- Attempt challenging shortcuts for potential time advantages
This progression from random movements to sophisticated racing illustrates the power of Q-learning. Through millions of iterations of this formula, the AI develops an increasingly accurate estimate of which actions lead to the highest rewards in every possible game situation.
Policy Gradients
Policy gradient methods are a type of reinforcement learning algorithm that directly improves how an agent makes decisions (its “policy”). While Q-learning focuses on estimating the value of each state-action pair, policy gradients directly adjust the decision-making strategy to maximize rewards.
Think about the Mario Kart AI in the video. Instead of calculating the exact value of every possible driving action in every situation, policy gradient methods would focus on improving the overall racing strategy directly:
- Policy as a Racing Strategy: The policy is the AI’s racing strategy – its rules for how to drive in each situation. It calculates the probability of taking each action (accelerate, drift left/right, use item) based on what it sees on screen.
- Racing and Collecting Data: The AI races using its current strategy and records what happened: what it saw, what actions it took, and what rewards it received (like checkpoint bonuses or penalties for hitting walls).
- Finding What Worked: After several races, the AI analyzes which driving choices led to better outcomes. For example, it might notice that drifting on corners led to staying on the track and faster lap times.
- Adjusting the Strategy: The AI then updates its racing strategy to increase the probability of making good choices that led to higher rewards. In the video, we saw how the AI gradually learned to drive more effectively, developing complex behaviors like using mushrooms strategically.
The key formula for policy gradients is:
In Mario Kart terms:
- θ represents the AI’s current racing strategy parameters
- α is how quickly the AI adjusts its strategy (learning rate)
- ∇θJ(θ) represents the direction of change that would improve race performance
For example, if the AI noticed that using a mushroom to clear away blooper ink (as mentioned in the video) consistently led to better race positions, it would adjust its strategy to increase the probability of saving mushrooms for this purpose.
The beauty of policy gradients is that they work well for complex scenarios with continuous actions or where the best strategy isn’t obvious. As we saw in the video, after millions of racing attempts, the AI evolved from randomly crashing to executing sophisticated racing strategies – not by calculating the exact value of each possible move, but by directly improving its overall racing approach based on what actually worked on the track.
Actor-Critic Methods
Actor-critic methods combine the strengths of value-based and policy-based RL approaches. They comprise two main components: an actor and a critic. The actor represents the policy that determines the actions to take in a given state, while the critic estimates the value function to evaluate the quality of the actions taken by the actor.
The key idea behind actor-critic methods is to use the critic’s value estimates to guide the actor’s policy updates. The critic learns to estimate the value function using techniques like value-based methods (e.g., Q-learning or temporal difference learning), while the actor learns to optimize the policy using policy gradient methods.
The actor-critic algorithm works as follows:
- The actor generates actions based on the current policy.
- The critic evaluates the actions taken by the actor and provides feedback in the form of value estimates.
- The actor uses the critic’s feedback to update its policy parameters using policy gradient methods.
- The critic updates its value estimates based on the observed rewards and the estimated future values.
By iteratively updating both the actor and the critic, the agent learns to make better decisions and improve its overall performance. Actor-critic methods have been successful in various domains, including robotics, game playing, and autonomous driving.
Think of the actor-critic method in RL as a director and a film critic working together on a movie. The director (the actor) decides how to shoot each scene, directing the actors on what to do and how to act according to the script. The film critic (the critic), meanwhile, watches the scenes and reviews them, offering feedback on how well each scene contributes to the overall movie.
Let’s break down this metaphor.
- The actor represents the “policy,” deciding the actions (or scenes) based on the current situation (or script). This is akin to the coach deciding which plays to run in a soccer game.
- The critic assesses the value of the actions taken by the actor, like a critic reviewing how well a scene works in a movie. This evaluation helps determine how much the scene contributes to the movie’s success.
Here’s how they collaborate using the actor-critic method:
- Action: The actor directs a scene based on the current version of the movie script (the policy).
- Evaluation: The critic watches the scene and decides how good it is—whether it adds tension, develops the character, or moves the plot effectively. This evaluation is akin to assessing the “value” of the scene in terms of its contribution to the film’s success.
- Feedback and Improvement: The critic’s review informs the director how to tweak the script or direct the scene differently. This feedback helps refine the policy (script) to ensure future scenes are better.
- Update and Repeat: The critic also updates its understanding of what makes a good scene, refining its evaluation criteria based on the overall progress of the movie and the feedback from the audience (the rewards).
By continuously updating the policy (direction) based on evaluations (reviews), and refining the evaluations themselves, the director and critic work together to improve the movie, scene by scene. This dual process helps in making more nuanced decisions, much like how actor-critic methods allow for sophisticated learning and decision-making in AI, enhancing performance in complex environments like robotics and game playing.
Transfer Learning

Transfer learning is a ML technique that focuses on leveraging knowledge gained from solving one problem and applying it to a different but related problem. The idea behind transfer learning is to use the knowledge, features, or representations learned from a source task to improve performance or speed up the learning process on a target task.
Long before I (Jason) ever learned to type, I took piano lessons. I remember my piano teacher drilling into me how to sit, how to hold my hands, where to place my feet, and many other technical aspects of playing the piano. Years later, when I really learned how to type, a lot of those same muscle memory skills came flooding back, even though I hadn’t played the piano in many years at that point. Although the instruments are different, many of the skills and behaviors I learned from piano lessons (e.g., how to sit, how to play without watching my hands, good posture, etc.) were transferable to learning how to type. This is the essence of transfer learning–using the knowledge gained from one domain to solve problems in another related domain.
Transfer learning is useful when the target task has limited labeled data, as it allows the model to benefit from the rich knowledge gained from the source task. By transferring knowledge, transfer learning can reduce the amount of labeled data required, accelerate the learning process, and improve the model’s performance on the target task.
Transfer learning can be broadly categorized into two main types: domain adaptation and multi-task learning.
Domain Adaptation
Domain adaptation is a type of transfer learning that focuses on adapting a model trained on a source domain to perform well on a target domain. The source and target domains share the same task but have different data distributions or feature spaces.
For example, maybe you wanted to conduct a sentiment analysis of an consider an e-commerce website. A sentiment analysis is using technology to understand the emotional tone of text, identifying if it’s positive, negative, or neutral. This is helpful for businesses to gauge customer satisfaction or for social media platforms to identify potentially harmful content. If you want to apply this model to analyze sentiment in social media posts, you would face a domain adaptation problem. The language, vocabulary, and writing style used in social media posts may differ from those in product reviews, even though sentiment analysis remains the same.
Imagine you’ve trained a high school football team to play on a grass field. Now, you’re asked to have them play on AstroTurf in another city. This new playing field has its quirks—the artificial turf provides more consistent footing but also creates faster play, different ball bounces, and even the risk of painful turf burns when players slide or fall. This change in environment is like moving from one “domain” to another in the world of AI. The game of football is the same, but the conditions have changed significantly. This situation is what we call “domain adaptation” in AI.
Take the example of a computer program that’s been taught to understand people’s opinions about products based on reviews on a shopping website. Now, suppose you want to use this same program to understand what people are saying on X. Posts on X are much shorter (280 characters for non-subscribers) than product reviews, use more slang, and include emojis that weren’t in the original data the program was trained on. This shift from product reviews to tweets is like moving the football team from natural grass to artificial turf. Although the task (understanding opinions) remains the same, the environment and the style of expression have changed.
To help the program adapt, here are some techniques used in domain adaptation:
- Feature alignment: This is like adjusting your team’s play style to account for the faster speed and more consistent ball bounces on artificial turf. For the AI, it means tweaking the data characteristics from the shopping reviews and tweets so they align more closely, making it easier for the program to understand both.
- Instance weighting: This is akin to giving more practice time to players who already perform well on turf fields or focusing on plays that work better on the faster surface. Similarly, the AI gives more importance to parts of the data from the reviews that are more like tweets.
- Adversarial learning: Think of this as having a scrimmage where half your team plays with turf-specific strategies while the other half tries to counter them. In AI, adversarial learning involves a sort of friendly battle within the program. One part of the program generates data that looks like it could belong to either the reviews or tweets, while another part tries to investigate if it’s really from the reviews or just pretending to be. This back-and-forth enables the program to become more adept at spotting differences in data style.
By employing these techniques, the program, like the football team, gets better at performing in the new environment (like X), leveraging what it learned in the old environment (like the shopping site). This process enhances the program’s ability to understand and analyze data across different platforms effectively.
Multi-task Learning
Multi-task learning is a type of transfer learning that involves training a model to solve multiple related tasks simultaneously. The goal is to leverage the shared knowledge and representations learned across different tasks to improve the model’s performance on each individual task.
For example, consider a model designed to perform both object detection and image segmentation. These tasks are related, as they both involve understanding the content and structure of images. By training the model on both tasks simultaneously, the model can learn shared features and representations that are beneficial for both object detection and image segmentation. The model typically consists of a shared architecture with task-specific layers or branches. The shared layers learn common features and representations that are relevant to all tasks, while the task-specific layers specialize in solving each individual task.
Imagine you’re a schoolteacher preparing lessons for three different subjects: math, science, and history. Instead of teaching these subjects separately, you find common themes and skills across all three and integrate them into your lessons. For example, teaching statistics in math can help with understanding data in science experiments, and historical events can provide context for scientific discoveries. This integrated approach is like what we call “multi-task learning” in the world of AI.
In multi-task learning, an AI model is trained to handle multiple tasks, such as recognizing objects in pictures and understanding what those objects are doing—both tasks involve processing visual information. By tackling these tasks together, the model learns to find and use common features and insights that apply to both tasks, which can improve its overall ability to understand images.
Here’s how it works:
- Shared architecture: This is like the general lessons in our classroom example, where the basics of math, science, and history are taught. In AI, these are common layers in the model that learn features that are useful across all tasks, like recognizing shapes and textures in images.
- Task-specific layers: These are like the specialized parts of the lesson that focus on the unique aspects of each subject. In AI, these layers are tailored to handle the specifics of each task, like distinguishing between different objects (object detection) or outlining them precisely (image segmentation).
The benefits of this multi-task learning approach include:
- Improved generalization: Just as students can apply their knowledge of statistics across different subjects, the AI model develops a more flexible understanding of images that helps it perform better and avoid mistakes specific to one task.
- Knowledge transfer: Insights gained from one task can enhance performance on others, like how understanding historical timelines can help students grasp scientific progressions.
- Reduced computational cost: Teaching three subjects together saves time and resources, much like training one AI model to do multiple tasks is more efficient than training several separate models.
However, multi-task learning also faces challenges:
- Balancing different needs: Like ensuring each subject in school gets enough attention, in AI, we must manage how much the model focuses on each task to avoid neglecting any.
- Dealing with noise or biases: Just as different subjects might have conflicting information or perspectives, tasks in AI might introduce specific errors or biases that need to be managed carefully.
- Designing the right model architecture: It’s crucial to structure the AI model in a way that supports learning across all tasks without one interfering with another, like planning a school curriculum that harmoniously blends different subjects.
Through multi-task learning, an AI model, like a well-rounded student, becomes adept at handling diverse challenges, using shared knowledge to enhance learning and performance across all tasks.
Gemini Summarizes ML
Okay, this section was quite technical (sorry!). To help with this, let’s see if we can get Gemini to help.
![]() |
Hey Gemini! I just wrote a section on Machine Learning for a new textbook. Unfortunately, it’s complicated and full of a lot of technical terms and ideas that could become easily overwhelming. I’m going to provide the text of this section. Can you provide a clear summary of the information and target it toward someone with no background in AI. Heck, let’s target it toward someone who is in the 6th grade.
|
![]() |
Absolutely! Let’s break down this section on Machine Learning so that a 6th grader can understand it. Imagine we’re explaining this to your younger sibling or a friend who loves video games and learning new things. Here’s a simplified summary: Machine Learning: Teaching Computers to Learn Like You!Have you ever taught a dog to do a trick? Or learned to ride a bike? That’s kind of what “Machine Learning” is all about, but for computers! Instead of teaching a dog or yourself, we’re teaching computers to learn and get better at things, all on their own. Think of it like this: Imagine you have a super smart puppy (that’s our computer!). You want to teach it to recognize cats and dogs.
That’s Machine Learning in a nutshell! It’s about giving computers lots of examples and letting them learn patterns so they can make smart guesses or decisions on their own, without us having to tell them exactly what to do every single time. Let’s get a little more specific, like leveling up in a video game! 1. Machine Learning is a Part of Something Bigger Think of AI as making computers act smart, like in movies where robots can talk and think. Machine Learning is one way to make AI happen. It’s like saying “basketball” is a type of “sport.” Machine Learning is a type of AI. 2. Computers Get Better by Using Data (Information) Data is just a fancy word for information! Like those pictures of cats and dogs. The more information (data) you give a computer, the better it gets at learning. It’s like practicing a video game – the more you play, the better you get! 3. “Algorithms” are Like Secret Recipes for Learning Computers use things called “algorithms” to learn. Think of an algorithm as a secret recipe or a set of instructions for the computer to follow to find patterns in the data. These recipes help the computer find clues and learn from them. 4. Machine Learning Can Do 3 Cool Things (Describe, Predict, Prescribe):
5. Less Human Help Needed! Once you’ve taught a computer using machine learning, it can often work on its own without needing a person to tell it what to do every step of the way. Think of a self-driving car! It learns how to drive by seeing tons of examples and then can drive itself without a person constantly steering. Different Ways Computers Learn (Like Different Classes at School!) There are different types of machine learning, just like there are different classes you take at school! Here are a few big ones: A. Supervised Learning: Learning with Labels (Like a Teacher Giving You Answers)
B. Unsupervised Learning: Learning Without Labels (Like Exploring on Your Own)
C. Reinforcement Learning: Learning by Trying and Getting Rewards (Like Training a Robot in a Maze)
That’s Machine Learning in a nutshell for a 6th grader! It’s all about teaching computers to learn from examples, find patterns, and make smart decisions, just like we do, but often with HUGE amounts of information. It’s used for tons of cool stuff like recommending videos you might like, helping doctors diagnose diseases, and even making self-driving cars work! Let me know if you’d like me to explain any part of this in even simpler terms! 😊 |
![]() |
My thinking process for summarizing the provided Machine Learning text for a 6th grader went something like this:
|
Key Takeaways
- ML is a subfield of AI that enables computers to learn from data and improve their performance without explicit programming, using algorithms to detect patterns and make decisions.
- Supervised learning uses labeled data to train algorithms, unsupervised learning finds patterns in unlabeled data, and RL involves learning through trial and error.
- Algorithms are essential in ML for analyzing data, identifying patterns, and making predictions or decisions.
- ML offers significant advantages like automation and improved decision-making, but it also faces challenges such as data quality, computational requirements, and ethical concerns.
Exercises
- Analyze case studies of ML applications in different industries, identifying the type of learning used and its impact.
- Design a simple ML algorithm for a specific task, such as predicting housing prices based on historical data.
- Discuss the ethical implications of ML , focusing on issues like bias, privacy, and transparency.
Neural Networks, Deep Learning, & Artificial Neural Networks
Learning Objectives
- Understand the purpose and process of neural networks in AI.
- Describe the key concepts and architectures in deep learning.
- Explain how artificial neurons are structured and how they process information in an artificial neural network (ANN).
- Discuss the role of activation functions, loss functions, and optimization algorithms in training ANNs.
- Apply the concepts of deep learning and ANNs to real-world problems and scenarios.
Today, machines not only crunch numbers but also see, listen, talk, and even dream up artwork. Machines can perform these new AI functions because of deep learning and Artificial Neural Networks (ANNs). By attempting to mimic the human brain, these technologies allow machines to learn from information and tackle tasks once considered uniquely human. Technically, both deep learning and ANNs fall under the larger umbrella of ML . However, separating the sections on Deep Learning and ANNs from ML allows us to engage in a more focused exploration of these advanced subfields, which require a deeper understanding of specialized concepts. ML has been around for decades, while deep learning and ANNs represent more recent advancements, driven by increases in computational power and the availability of large datasets. By building a strong foundational knowledge of ML first, we can better appreciate the complexity and transformative impact of deep learning and ANNs.
Neural Networks
Have you ever wondered how computers can recognize faces in photos or how your smartphone understands your voice commands? The answer lies in a technology called neural networks. In this section, we’ll explore what neural networks are, why they’re so powerful, and how they’re changing the world.
What are Neural Networks?
Imagine trying to teach a computer to recognize different types of fruit. You could write a long list of rules: “If it’s round and red, it might be an apple. If it’s yellow and curved, it could be a banana.” But what about a green apple, a cherry, or a very ripe banana? Writing rules for every possibility would be overwhelming, if not impossible.
This is where neural networks come in. They’re a way to help computers learn and make decisions more like humans do. Instead of following a strict set of rules, neural networks learn from examples. A neural network is like a digital brain. It’s made up of interconnected “neurons” that work together to process information and make decisions.
Why are Neural Networks Special?
Neural networks have some incredible advantages over traditional computing methods. Let’s explore these advantages using our fruit classification example to see why neural networks are so powerful and versatile.
They Can Handle Complexity
Neural networks excel at tasks that are easy for humans but hard to describe with rules, like recognizing objects in images or understanding spoken language. Imagine trying to write a traditional computer program to identify different fruit. You might start with rules like:
- If the object is round and red, it’s probably an apple.
- If it’s yellow and curved, it’s likely a banana.
- If it’s small, round, and purple, it could be a grape.
But what about a green apple? Or a slightly unripe banana that’s still greenish? What about the countless variations in shape, size, and color that fruits can have? Writing rules for every possibility would be extremely complex and time-consuming.
A neural network can learn to recognize fruits in all their variations simply by being shown many examples. It can handle the complexity of real-world fruit appearances without needing explicit rules for every possibility.
They Improve with Experience
The more data a neural network processes, the better it gets at its job. It’s like how you get better at a skill the more you practice. Let’s say we initiate training our neural network with 100 images each of apples, bananas, and oranges. Our computer might do a decent job at differentiating between the three types of fruit, but it could struggle with less common varieties—there are 7,500 apple varieties in the world— or fruits photographed from unusual angles.
Now, imagine we feed it 10,000 images of each fruit, including many varieties and angles. Our network will probably become much more accurate. It might learn to recognize that an apple can be red, green, or yellow, and that bananas can be anywhere from green to brown depending on ripeness.
The more examples it sees, the better it gets at identifying the key features that distinguish fruit, regardless of minor variations.
They Can Find Hidden Patterns
Neural networks can discover relationships in data that humans might miss, making them valuable in fields like scientific research and financial analysis. While training our fruit classification network, it might discover patterns that aren’t obvious to humans. For instance:
- It might notice that apples have a certain texture in images that humans don’t consciously register.
- It could pick up on subtle color gradients that are characteristic of ripening bananas.
- It might learn to use tiny details like the pattern of the fruit’s skin or the shape of the stem to aid in classification.
These hidden patterns enable the network to make accurate predictions, even in complex cases where traditional rule-based systems might fail. Sometimes, the patterns that neural networks detect are not immediately explainable by humans. As neural networks grow more sophisticated, it becomes increasingly challenging for humans to understand the decision-making process, a phenomenon known as the “black box problem.” This challenge, often referred to as the “explainability problem,” is a critical issue in AI and ML. It underscores the difficulty in interpreting how these models arrive at specific conclusions, which can be particularly problematic in high-stakes applications like healthcare or finance. We’ll delve deeper into the implications and potential solutions for this problem in Chapter 4.
They’re Versatile
The same basic neural network design can be adapted to solve many types of problems, from playing chess to predicting weather patterns. The neural network we’ve designed for fruit classification could be easily adapted for other tasks with minimal changes to its architecture. For example:
- Vegetable Classification: We could use the same network structure but train it on images of vegetables instead of fruits.
- Ripeness Detection: With some tweaks, we could adapt the network to determine if a fruit is ripe or not.
- Fruit Disease Identification: The same basic structure could identify diseases in fruits by training it on images of healthy and diseased fruits.
- Yield Prediction: With data, a similar network could predict fruit yields based on images of fruit trees or orchards.
This versatility means that once you understand how to work with neural networks, you can apply this knowledge to a wide range of problems across many fields.
Real-World Applications
Neural networks aren’t just a cool idea—they’re already changing the world. Here are a few examples:
- Image Recognition: When your phone automatically tags your friends in photos, it’s using a neural network.
- Voice Assistants: Siri, Alexa, and Google Assistant use neural networks to understand and respond to your voice commands.
- Medical Diagnosis: Neural networks are helping doctors detect diseases earlier and more accurately by analyzing medical images and patient data.
- Self-Driving Cars: Neural networks help autonomous vehicles understand their environment and make driving decisions.
- Language Translation: Online translation tools use neural networks to provide more accurate and natural-sounding translations.
The Architecture of Neural Networks
Now that we understand neural networks and why they’re important, let’s peek under the hood and see how they’re built. A neural network’s structure is often called its “architecture,” and understanding this architecture is key to grasping how these digital brains work.
To help visualize the structure of a neural network, imagine a multi-level fruit stand at a bustling market. This fruit stand has three main levels, each playing a crucial role in identifying and sorting fruit. Let’s explore each level.
The Input Layer: The Bottom Shelf
Think of the bottom shelf of our fruit stand as the input layer. This is where we place our fruit to be identified. In a neural network, this layer receives the initial data. Each spot on this shelf represents a feature of our data.
For our fruit example, these features might be:
- Color (Is it red? Yellow? Green?)
- Shape (Is it round? Elongated?)
- Size (Is it small like a berry or large like a melon?)
- Texture (Is the skin smooth or rough?)
In a real neural network dealing with images, each “spot” might represent a pixel in the image.
The Hidden Layers: The Middle Shelves
The middle shelves of our fruit stand represent the hidden layers. This is where the magic happens! In our analogy, workers on these levels examine the fruit from the bottom shelf and pass information up.
These hidden layers process the information from the previous layer and pass it on to the next. A neural network can have many hidden layers—the more layers, the “deeper” the network. This is where the term “deep learning” comes from. We’ll explore this more later in this chapter.
In our fruit stand, a worker might combine information like “yellow” and “curved” to recognize the shape of a banana. The next level might combine this banana shape with size information to distinguish between a regular banana and a plantain.
Understanding Parameters in Neural Networks
Each connection between neurons in our network has a weight, and each neuron has a bias. These weights and biases are what we call the “parameters” of the network. They’re like the tools and knowledge our fruit stand workers use to make their decisions. In our fruit stand analogy, weights are like the importance each worker assigns to different fruit characteristics, and biases are like each worker’s personal preferences or tendencies.
The total number of parameters in a network is the sum of all these weights and biases. This number can get very large, very quickly!
Parameters and Large Language Models (LLMs)
Now, let’s scale up our fruit stand to something much, much bigger. Imagine instead of identifying fruits, we’re trying to understand and generate human language. This is what LLMs do.
When writing this section, Meta released its latest LLM (Llama 3.1), which has three different versions: 8 billion, 70 billion, and 405 billion parameters.
- 8 billion parameters: This is like having a fruit stand with thousands of levels and millions of workers, each looking at tiny, subtle details of the “fruit” (in this case, pieces of text).
- 70 billion parameters: Even more levels, even more workers. This network can understand and generate more nuanced and complex language patterns.
- 405 billion parameters: This is like having a fruit stand the size of a city! With this many parameters, the model can capture incredibly subtle patterns in language, understand context, and generate highly coherent and diverse text.
Why More Parameters Matter
Just like how more workers and levels in our fruit stand can help identify more types of fruit with greater accuracy, more parameters in a language model allow it to:
- Capture more patterns: It can understand more subtle nuances in language.
- Handle more contexts: It can better understand how the meaning of words changes in different situations.
- Generate more diverse outputs: It has a larger “vocabulary” and more ways to combine words.
However, just like a massive fruit stand would require more resources to run, these large models require significant computational power and data to train and operate. For example, with Llama 3.1, a good computer with a GPU should be able to run the 8B model; a fast computer with a couple of GPUs (or a top-of-the-line GPU) can easily run the 70B model. And, two computers working together with good GPUs can run the 405B model. As you can see, the more parameters the models have, the more computer power you need to run those models. That’s why many companies are focusing on creating small and medium-sized models that can run accurately, fast, and cheaply. In the Foundation Model world, we have models like ChatGPT-4o mini (estimated between 40 and 70B parameters) and Claude Sonnette 3.5 (estimated 70B parameters) as examples of these small but mighty models; whereas, ChatGPT-4o (estimated 1.6 trillion to 175 trillion parameters), Claude Opus (estimated 2 trillion parameters), and Grok 3 (2.7 trillion parameters) are examples of larger parameter models. If you’re wondering why we say “estimated” here, it’s because the companies behind the large foundation models aren’t as willing to disclose what’s going on under the hood as those who are developing open-source models like Meta’s Llama.
The Output Layer: The Top Shelf
The top shelf of our fruit stand is the output layer, where we get our final result. In our fruit identification network, this layer tells us what type of fruit we’re looking at.
The output could be a single answer (“This is an apple”) or a set of probabilities (“70% chance it’s an apple, 30% chance it’s a pear”).
Connecting the Layers: The Workers and Their Tools
In our fruit stand analogy, the workers moving between levels and passing information represent the connections between neurons in different layers. Each worker has a set of tools:
- Weights: These are like the worker’s “importance scales.” They determine how much attention to pay to each piece of information from the previous layer.
- Activation Functions: Think of these as the worker’s decision-making process. After weighing all the information, the activation function determines what information to pass on to the next level.
Putting It All Together

Figure 2.4 is a diagram that illustrates the process of forward propagation in a neural network. When we input data (represented by the features ‘Shape,’ ‘Color,’ and ‘Texture’ of an apple), it moves through the network’s layers (depicted by interconnected nodes in the hidden layers). The data is processed and refined at each layer until it reaches the output layer, where it is identified as an ‘Apple.’
The beauty of this system lies in its adaptability. By adjusting the number of layers, the number of neurons in each layer, and the weights of the connections between neurons, we can design neural networks tailored to a variety of tasks—from identifying fruits to recognizing faces or even predicting stock prices!
Components of Neural Networks
Now that we’ve explored the overall architecture of neural networks using the diagram of the apple and the network, let’s zoom in on the individual components that make these digital brains function. Understanding these building blocks will give you a deeper appreciation for how neural networks process information and make decisions.
Neurons: The Workhorses of Neural Networks
In our diagram, each node (or neuron) acts like a specialized unit that processes information, much like the workers who inspect fruits in our analogy. These neurons are the fundamental units of computation in a neural network.
But what exactly does a neuron do? Let’s break it down:
- Receiving Input: Each neuron in the input layer receives data, such as the shape, color, and texture of the apple.
- Processing: The neuron then performs calculations on this input, similar to how our fruit inspectors evaluate the fruit’s characteristics.
- Producing Output: Based on its calculations, the neuron generates an output, which it sends to the next layer of neurons.
Think of each neuron as a tiny calculator specialized in one task: processing input data and producing an output.
Weights: The Importance Scales
In the neural network diagram, each connection between neurons is assigned a weight. These weights determine the significance of the connection, similar to how our imaginary fruit stand workers prioritize different attributes of fruits.
- A higher weight means that connection is more important.
- A lower weight means that connection is less important.
- A weight can even be negative, which means it can decrease the importance of a connection.
How Weights Work:
- Higher Weight: Indicates a more significant connection.
- Lower Weight: Indicates a less significant connection.
- Negative Weight: Can even reduce the importance of a particular input.
For instance, when identifying an apple, connections related to “round shape” might have a high weight, while those for “yellow color” might have a lower or negative weight. As the network learns, it adjusts these weights to improve accuracy.
Activation Functions: The Decision-Makers
After a neuron processes inputs and applies weights, it needs to decide what output to produce. Activation functions handle this decision-making process.
Let’s look at a few common types of activation functions:
- Step Function: This is the simplest type. It’s like a light switch – either on or off. If the input is above a certain threshold, the neuron “fires” (outputs a 1); otherwise, it doesn’t (outputs a 0).
- Sigmoid Function: This function outputs values between 0 and 1. It’s great for problems where we need to predict probability.
- ReLU: This function outputs the input if it’s positive, and 0 if it’s negative. It’s currently very popular because it helps networks learn faster.
Think of activation functions as adding a layer of decision-making to our fruit stand workers. Instead of just passing along raw numbers, they’re making judgments about the information they’re processing.
Putting It All Together
So, how do all these components work together? Let’s walk through it:
- Input Layer: Receives raw data, such as the features of the apple.
- Hidden Layers: Process this data through a series of weighted connections and activation functions.
- Output Layer: Produces the final decision, like identifying the object as an apple.
This process happens simultaneously across all neurons in each layer, allowing the network to process complex information quickly.
Understanding these components—neurons, weights, and activation functions—is crucial to grasping how neural networks learn and make decisions.
How Neural Networks Process Information
Now that we understand the architecture and components of neural networks, let’s explore how these digital brains actually process information. This journey from input to output helps us understand how neural networks can perform complex tasks.
Step 1: Preparing the Input
The first step in utilizing a neural network is to prepare the input data, which in our example involves features like shape, color, and texture of an apple. This is akin to setting the stage for the network to begin its work.
Data Representation
In our diagram, the apple’s attributes are represented numerically as inputs to the network. This numerical encoding process is essential for neural networks to process data. For example, if we were building a neural network to recognize handwritten digits:
- Each pixel in the image might be represented by a number ranging from 0 (black) to 255 (white).
- A 28×28 pixel image would result in 784 input values (28 * 28 = 784).
- These values are then fed into the input neurons, with each neuron corresponding to one input value.
Normalization
Normalization involves scaling these input values to a standard range, typically between 0 and 1. This step ensures uniformity, much like ensuring all fruit at our fruit stand is similarly sized for fair comparison. Normalization helps the neural network process different types of data more effectively and can accelerate the learning process.
Step 2: Forward Propagation
With the input data prepared, the neural network processes this information through a mechanism known as forward propagation. This process involves the data moving forward through the network—from the input layer, through the hidden layers, to the output layer.
Input Layer
Each neuron in the input layer receives a distinct piece of the input data. In our apple example, this could correspond to specific attributes like color or texture.
Hidden Layers
The real processing power of the network lies in the hidden layers. Each neuron in a hidden layer:
- Receives inputs from all neurons in the previous layer.
- Multiplies each input by its corresponding weight, a value that signifies the importance of the input.
- Sums these weighted inputs to compute a total input.
- Applies an activation function to this sum to produce an output value.
- This output is then passed to neurons in the next layer.
This process occurs across all hidden layers, with each layer extracting and combining features from the previous one, allowing the network to understand more complex attributes.
Output Layer
The final outputs from the last hidden layer are passed to the output layer. The neurons here use the same method as those in the hidden layers—calculating weighted sums and applying activation functions—to produce the final results. This is analogous to our fruit stand workers analyzing and compiling information at each level before arriving at a final conclusion.
Step 3: Generating the Output
The activation of neurons in the output layer yields the network’s final prediction or classification.
Classification Tasks
For classification tasks, such as identifying the type of fruit, each neuron in the output layer represents a different class. The neuron with the highest activation shows the network’s prediction. For example, if the neuron corresponding to “Apple” has the highest activation, the network identifies the input as an apple.
Regression Tasks
For regression tasks, where the goal is to predict a continuous value, the output might be a single number representing the predicted value. For example, rather than simply identifying if an image shows an apple, a regression task might predict the weight of the apple in grams, the percentage of ripeness, or how many days until the apple will be perfectly ripe based on its color and appearance.
This forward propagation process is powerful because it allows the network to approximate complex functions. As the data moves through the network when identifying an apple:
- Early layers might identify basic features, such as edges, colors, or the fruit’s round shape.
- Middle layers might recognize more complex patterns, like the characteristic indentation at the top or the smooth skin texture.
- Later layers synthesize these patterns to determine that this fruit is indeed an apple rather than a similar round fruit like a peach or a tomato.
This hierarchical processing enables neural networks to perform tasks that are challenging to program using traditional rule-based methods.
Deep Learning

It’s probably impossible for you to remember the moment you realized there was a difference between a house cat and a lion. Sure, they’re both felines, but they’re clearly very different creatures. Teaching children to differentiate between animals is a task many parents take on during their children’s formative years. At first, children might only know basic shapes and colors, but with each new animal they learn about, they begin to understand more complex patterns, like distinguishing a tiger from a house cat. This is similar to how deep learning works. Deep learning focuses on training ANNs with multiple layers to learn hierarchical representations of data, much like how a child learns through experience.
Deep learning has revolutionized the field of ML by enablingDeep learning the development of robust models that can learn from vast amounts of data and achieve state-of-the-art performance on a wide range of tasks. Some of the most popular deep learning architectures include convolutional neural networks, recurrent neural networks, generative adversarial networks, and autoencoders.
Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are a type of deep learning architecture specifically designed for processing grid-like data, such as images. CNNs have achieved remarkable success in computer vision tasks, including image classification, object detection, and semantic segmentation.
The main principle behind CNNs is to apply a number of convolutional filters—also referred to as kernels—to the input image before performing pooling operations and nonlinear activations. The convolutional layers learn to detect local patterns and features in the image, such as edges, textures, and shapes. As the network goes deeper, the learned features become more abstract and globally relevant.
Imagine you’re an artist trying to master painting landscapes. To do this, you start by focusing on small, simple details like the shapes of leaves and the textures of rocks. As you get better, you begin to understand how to combine these elements into larger parts of the scene, like trees and mountains. Over time, you learn to view the entire landscape as a cohesive whole, knowing how each part contributes to the overall image. This process of learning from the small details to the entire scene is similar to how CNNs process images in the world of AI.
CNNs are like smart artists designed specifically for understanding and analyzing images.
- Layered Approach: Just as you might start a painting by sketching out basic shapes and then adding layers of detail, a CNN processes an image in layers. The first layers might only see simple edges or textures, while deeper layers start recognizing more complex parts like eyes or wheels.
- Convolutional Filters: Think of these as your set of paintbrushes, each one designed to highlight different features of the landscape—some brushes pick out the edges of objects, others focus on textures, and some might be perfect for capturing the play of light and shadow.
- Pooling Operations: This is akin to stepping back to look at your painting from a distance, deciding which details are important and which can be blurred out a bit. Pooling helps CNN reduce the amount of information it needs to handle by simplifying the details while keeping the important parts.
- Nonlinear Activations: These are like decisions in the artistic process that determine how to mix colors or adjust contrasts to make certain features stand out more than others.
CNNs have several advantages:
- Translation Invariance: Just as a skilled artist can recognize a tree whether it’s in the foreground or background of a painting, CNNs can recognize objects no matter where they appear in the image.
- Spatial Hierarchy: Like building up a painting from broad washes to detailed highlights, CNNs understand images from simple shapes to complex structures.
- Parameter Efficiency: CNNs are like a set of versatile brushes that can create a variety of strokes, reducing the need for countless specific tools and making the learning process more efficient.
Popular architectures like LeNet and AlexNet are like well-known painting styles or techniques that have been developed and refined to handle specific kinds of visual tasks, from recognizing faces to navigating roads. By learning to process images in this structured, layered way, CNNs become capable of performing complex visual tasks with a level of detail and accuracy that mimics human sight.
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are a type of deep learning architecture designed for processing sequential data, such as time series or natural language. They can maintain an internal state or memory, allowing them to remember previous information in a sequence – like recognizing an apple in an image even if part of it is hidden behind other fruits, or tracking how an apple changes color as it ripens over a series of images taken across several days.
The key idea behind RNNs is to process the input sequence element by element, updating the internal state at each time step based on the current input and the previous state. This recurrent structure enables RNNs to learn and remember information from the past, making them suitable for tasks that require understanding and generating sequences.
Imagine you’re at a party, and you’re telling a story to a group of friends. As you go through the story, each sentence builds on the last, and your friends remember the details you’ve shared so far to understand the context as the story unfolds fully. This ability to remember past details and use them in ongoing conversations is like how RNNs process sequences of data.
RNNs handle tasks where the order and context of your data matter, like listening to music, predicting stock trends, or, notably, working with languages.
- Handling Sequences: Just like how you would not forget the beginning of your story as you reach the climax, an RNN processes data (like words in a sentence or steps in a video) one piece at a time, while remembering what came before. This memory helps the RNN understand the entire sequence in context.
- Updating Internal State: Think of each sentence in your story as updating what your friends remember about the story’s plot. Similarly, as an RNN processes each piece of data, it updates its internal state based on what it has just seen and what it remembers from before. This is how it keeps track of the sequence so far.
RNNs come in a few key flavors:
- Simple RNN: This is like telling a story using only your short-term memory for quick, simple tales.
- Long Short-Term Memory: This version is like having a notebook with you. You jot down important points from your story that you’ll need to remember to maintain the flow and context, helping you manage longer, more complex tales.
- Gated Recurrent Unit: This is like a simplified notebook where you only write supercritical updates. Admittedly, it is easier and quicker to use than the full notebook but nearly as effective.
RNNs have proven helpful in tasks like writing and completing sentences, translating languages, or even generating speech that sounds natural. This is because, like a good storyteller, they can remember what happened before and use that information to make what comes next make sense. By maintaining this flow and remembering the sequence, RNNs can generate outputs that are coherent and contextually relevant, just like a well-told story at a party.
Generative Adversarial Networks
The concept of general adversarial networks (GANs) was initially proposed by Ian Goodfellow and his colleagues first proposed in 2014 and has quickly emerged as one of the most exciting and significant concepts in GenAI.[6]
GANs are a class of deep learning models that consist of two neural networks, a generator and a discriminator, trained in a competitive setting.
Imagine GANs as a master forger working with an art critic, both aiming to perfect the forger’s skill in creating convincing fake paintings. The forger, called the “generator,” tries to produce paintings so perfect that they could pass as originals. The critic, known as the “discriminator,” evaluates each piece to determine whether it’s a genuine masterpiece or a forgery.
The generator network takes random noise as input and learns to create increasingly convincing paintings. Meanwhile, the discriminator network carefully examines both authentic artworks and the generator’s forgeries, learning to tell them apart by noticing subtle details like mismatches in color or inappropriate artistic strokes.
These networks train in opposition to each other – the forger constantly improves by learning from the critic’s feedback. In contrast, the critic becomes increasingly skilled at detecting even the most subtle signs of forgery. The generator tries to maximize the probability of the discriminator mistaking its forged paintings for real ones, while the discriminator works to identify the fakes accurately.
During training, both continue to improve their skills: the forger creates paintings that resemble the originals with proper coloration, texture, and shape details, while the critic develops a more discerning eye for the telltale signs of forgery.
The result is a generator network (our master forger) that eventually produces forgeries so convincing that they could fool even experts – creating high-quality artificial data that closely resembles the original paintings with no need to collect thousands of additional samples of the original artist’s work.
Here’s how the process works:
- The Generator’s Task: This network starts with random scribbles (random noise) and tries to turn them into artworks. Its goal is to make these creations so lifelike that the discriminator cannot tell they’re fake.
- The Discriminator’s Task: This network reviews each artwork to decide if it’s real or created by the generator. It’s like an art expert who scrutinizes paintings to catch any signs that they might not be authentic.
- Training as a Game: Both the generator and the discriminator are trained together in what’s similar to a game. The generator tries to outsmart the discriminator by improving its fakes, while the discriminator gets better at spotting those fakes. It’s a continuous loop of action and reaction, with each side pushing the other to improve.
- Outcome: If everything goes well, the generator becomes a master at creating realistic, diverse artworks, and the discriminator becomes a top-notch critic. The result is that the generator can produce original pieces of art that look as if they were real.
Applications of GANs include:
- Image Synthesis: Just like creating paintings, GANs can generate new images of faces, landscapes, or even artworks that look strikingly real.
- Data Augmentation: They can create new data for training other ML models, like generating new images to train a computer vision system.
- Style Transfer: GANs can apply the style of one image (like Van Gogh’s brush strokes) to the content of another, effectively creating a new piece of art.
- Anomaly Detection: They can help spot rare or unusual data points by comparing what’s typical (real) with what’s generated.
Despite their impressive capabilities, training GANs can be tricky. The training process often suffers from instability issues where the generator and discriminator can get stuck in a cycle where neither improves. Additionally, GANs can experience “mode collapse,” where the generator produces only a limited variety of outputs (like creating one part of a painting over and over again while forgetting the rest of the original masterpiece). Balancing the learning rates and power between the forger and critic requires careful tuning to achieve high-quality, diverse results.
Autoencoders
Autoencoders are a type of deep learning model used for unsupervised learning and representation learning. Their goal is to learn a compact and meaningful representation (encoding) of the input data. Imagine an autoencoder as a highly skilled artist who specializes in both drawing apples and creating simplified sketches of them. This artist works in a unique two-step process:
- The Encoder (The Sketcher): First, the artist looks at a detailed photograph of an apple and creates a simple sketch capturing only the most essential features – perhaps just the basic outline, stem position, and a few key shadows. This simplified sketch represents the “latent space” (a compressed version that contains much less information than the original photograph but still captures the apple’s essence).
- The Decoder (The Renderer): Then, working only from this minimal sketch, the artist attempts to recreate the original detailed apple photograph, adding back all the texture, color variations, and fine details. The better the artist, the more closely this recreation will match the original.
The training aim is to minimize the difference between the original apple image and the recreated one. Through practice, our artist learns which apple features are truly important to include in the sketch (the stem’s position, the overall shape, etc.) and which details can be compressed (exact texture patterns, perfect shadows, etc.) while still allowing for accurate reconstruction.
This process proves valuable in several ways:
- Dimensionality Reduction: Just as the sketch uses fewer lines than the detailed photograph, autoencoders compress high-dimensional data (like a 1024×1024 pixel apple image) into a much smaller representation (perhaps just 64 numbers that encode the essential features).
- Feature Learning: The autoencoder learns which aspects of apples are most important for recognition – perhaps discovering that the indentation at the top is more distinctive than the exact color gradient.
- Denoising: If given photographs of apples with blemishes or shadows, a well-trained autoencoder can learn to produce clean, idealized apple images – like an artist who knows to omit the slight bruises when recreating the apple.
- Anomaly Detection: If shown an orange instead of an apple, our artist would struggle to recreate it using techniques optimized for apples accurately. Similarly, autoencoders perform poorly when trying to reconstruct data that differs significantly from their training examples, making them useful for detecting anomalies.
Variants, like variational autoencoders, add controlled randomness to this process—like an artist who doesn’t just copy exactly what they see but introduces creative variations while still maintaining the “apples” of their subject.
Overall, autoencoders help AI systems identify and focus on what’s truly important in complex data, just as our artist learns to distinguish the essential characteristics that make an apple recognizable from the incidental details that vary from one apple to another.
Artificial Neural Networks (ANNs)

Artificial Neural Networks (ANNs) are a fundamental concept in ML and AI, inspired by the structure and function of the human brain. ANNs are composed of interconnected nodes, called artificial neurons, that work together to process and learn from data.
Imagine a vast network of interconnected nodes, where each node represents a neuron in the brain. Just as neurons in the brain receive, process, and transmit information, artificial neurons in an ANN perform similar functions. They take input data, perform computations, and pass the results to other neurons in the network.
The power of ANNs lies in their ability to learn and adapt based on the data they are exposed to. By adjusting the strength of the connections between neurons (known as weights) and the thresholds at which neurons activate, ANNs can learn to recognize patterns, make predictions, and solve complex problems.
Feedforward Neural Networks
Feedforward Neural Networks are the simplest type of ANNs, where information flows in one direction from the input layer to the output layer. They comprise multiple layers of neurons, including an input layer, one or more hidden layers, and an output layer.
Think of a feedforward neural network as a one-way street that goes through a town. The data enters this street at one end (the input layer) and travels through several checkpoints or stations (hidden layers) where different workers (neurons) perform specific tasks on it, like checking IDs, sorting packages, or applying stickers. Each worker adds something to the data or transforms it. Finally, the data exits the town at the other end (output layer), where the final decision or output is given, like the destination of a package.
Each neuron in a feedforward neural network receives input from the neurons in the previous layer, applies a mathematical function to those weighted inputs, and passes the result to the neurons in the next layer. This process continues until the output layer is reached.
Backpropagation
Backpropagation is the key algorithm used to train feedforward neural networks. It is a supervised learning technique that adjusts the weights of the connections between neurons based on the error between the predicted output and the actual output.
Have you ever tried to teach a dog a new command or trick? If the dog fails to perform the new command or trick correctly, you don’t start over from scratch; instead, you go back a few steps and correct the specific action that went wrong. Similarly, backpropagation in neural networks involves going backward through the network (from output to input), tweaking and adjusting things slightly, to reduce the errors in predictions. It’s like fine-tuning your instructions based on the dog’s performance to improve its skills gradually.
During training, the network makes predictions based on the input data. The predicted output is compared to the actual output, and the error is calculated. The error is then propagated backward through the network, adjusting the weights of the connections to minimize the error. This process is repeated iteratively until the network learns to make accurate predictions.
Activation Functions
Activation functions are like gatekeepers in a neural network, deciding whether a neuron should be activated based on the input it receives. This decision-making introduces non-linearity into the network, which helps it learn and model complex patterns. Here’s a look at some common activation functions:
- Sigmoid: Think of this function as a bouncer at a club who lets everyone in but gives them a specific value between 0 and 1 based on their VIP status. It’s often used in binary classification problems because it transforms the input into a range that can represent probabilities.
- Rectified Linear Unit (ReLU): This bouncer only lets people in if they have a positive VIP pass. If the input is positive, it goes through unchanged; if it’s zero or negative, it gets blocked. ReLU is popular in deep learning because of its simplicity and effectiveness in speeding up training.
- Tanh (Hyperbolic Tangent): Imagine a bouncer who considers both positive and negative inputs and assigns a value between -1 and 1. This function is similar to Sigmoid but provides a wider range of output values, making it useful for more nuanced decisions.
By introducing these gatekeepers, activation functions help neural networks learn from data more flexibly and powerfully. They play a crucial role in enabling networks to understand and model the complex relationships within data, much like how different bouncers enforce various entry rules to maintain the right atmosphere at a club.
Loss Functions
Loss functions measure the difference between the network’s predicted output and its actual output. They quantify the network’s performance and guide the learning process by providing a feedback signal for optimization. Loss functions in neural networks can be likened to a golf game. The objective is to hit the ball into the hole with as few strokes as possible. Each stroke represents an attempt to predict the output, and the distance from the ball to the hole after each stroke is analogous to the “loss,” or the error between the predicted and actual outcomes. Your goal is to tweak your playing strategy (model parameters) to minimize these strokes (loss), aiming for a hole-in-one (perfect prediction).
- Mean Squared Error: Calculates the average squared difference between the predicted and actual outputs. It is commonly used in regression problems.
- Cross-Entropy: Measures the dissimilarity between the predicted probability distribution and the actual probability distribution. It is often used in classification problems.
Imagine each hole on a golf course is a different problem you want your model to solve. When you hit the ball, where it lands is your model’s prediction. The hole itself is the true, correct answer or outcome. Mean Squared Error is like measuring the distance from where your ball landed to the hole, squaring that distance (which makes sure we consider larger errors more severely), then averaging these squared distances over the course of a game. This gives you a clear idea of how well you are playing overall—the smaller the score, the better your accuracy.
Now, consider a different scenario in golf where each hole has multiple sections or pockets, each representing a different category or class. Your goal is not only to get the ball into the correct hole but specifically into the right section of that hole. Cross-entropy measures how good you are at not only hitting the ball into the right hole but also getting it into the correct section. If you aim for one section (a predicted class) but end up in another (actual class), cross-entropy calculates the discrepancy between your intended target and where you landed. This is especially useful in tasks like classification, where you need to be precise, not just about getting it right, but how right or wrong you are.
The goal of training an ANN is to minimize the loss function, which shows that the network’s predictions are becoming more accurate.
Optimization Algorithms
Optimization algorithms are like navigation tools for a neural network, guiding it to adjust the weights of connections between neurons to minimize errors and improve performance. They determine how the network learns and adapts, making sure it finds the best path to its goal.
Think of optimization algorithms as different strategies for reaching a destination, like getting to the top of a mountain. Depending on the terrain, your fitness level, and the time you have, you might choose different routes. Similarly, optimization algorithms offer various methods for updating the weights in a neural network to achieve the optimal solution. Let’s look at three common optimization algorithms.
- Gradient Descent: This is like recalculating your route at every major intersection, always choosing the path that descends steeply toward the lowest point (the minimum of the loss function). It iteratively adjusts the weights to move toward the steepest descent of the loss function.
- Stochastic Gradient Descent (SGD): Imagine making route decisions one turn at a time, without considering the overall traffic on the entire route. This method updates the weights based on a single training example at a time, making it computationally efficient for large datasets.
- Adam (Adaptive Moment Estimation): Think of this as a sophisticated GPS that learns from past traffic patterns and adapts to current conditions more effectively. Adam adapts the learning rate for each weight based on the historical gradients, combining the benefits of momentum and adaptive learning rates.
Optimization algorithms help neural networks find the best solutions by efficiently navigating the complex landscape of data inputs and errors. They ensure the network converges to a solution that minimizes the loss function, improving the network’s overall performance.
Sidebar – From AI to Generative AI

This image illustrates the nested relationship among key concepts in artificial intelligence. AI is the broadest category, encompassing any technique that enables computers to mimic human intelligence. Within AI is ML, which uses algorithms to detect patterns in data. Deep Learning, in turn, is a subset of ML that relies on layered neural networks to process complex inputs. Finally, Generative AI sits within Deep Learning, focusing on models that can create new content—such as text, images, or other data—by learning patterns from existing datasets.
We will explore Generative AI in more detail in Chapter 3.
Key Takeaways
- Deep learning architectures, such as CNNs, RNNs, GANs, and autoencoders, enable machines to learn hierarchical representations and perform complex tasks in various domains.
- Artificial neurons in ANNs receive, process, and transmit information, mimicking the behavior of biological neurons in the human brain.
- Activation functions, loss functions, and optimization algorithms are crucial components in training ANNs and minimizing the error between predicted and actual outputs.
- Deep learning and ANNs have vast potential for solving real-world problems and are transforming industries such as computer vision, natural language processing, and robotics.
Exercises
- Imagine you are designing a CNN for a self-driving car. What kind of layers and filters would you include to enable the car to recognize traffic signs, pedestrians, and other vehicles? Sketch out a simple CNN architecture and explain your reasoning.
- Create an analogy for the backpropagation process in ANNs. How would you explain this concept to a friend who has no background in AI or ML? Use a real-life example to illustrate the iterative process of adjusting weights to minimize the error.
- Research a real-world application of GANs, such as image synthesis, style transfer, or anomaly detection. Write a short paragraph explaining how GANs are used in this application and the benefits they provide.
- Suppose you are building an ANN for sentiment analysis of movie reviews. What type of activation function would you choose for the output layer and why? Explain how this activation function helps in classifying the sentiment as positive or negative.
- Compare and contrast two different optimization algorithms, such as Gradient Descent and Adam. What are the key differences between them, and in what scenarios might one be preferred over the other? Provide examples to support your explanation.
Natural Language Processing
Learning Objectives
- Define NLP and explain its core components.
- Describe common natural language processing techniques used to analyze and interpret text.
- Identify and discuss various applications of NLP.
- Discuss the common challenges in NLP.

Natural language processing (NLP) has garnered a lot of attention since the release of ChatGPT in November 2022, but the origins of this area of research date back to the 1950s.[7] The earliest form of NLP involved teaching a computer to translate Russian into English. Over the years, the area of NLP has evolved.
Defining NLP
For our purposes in this chapter, we define natural language processing (NLP) as:
A subfield of AI that combines computational linguistics, statistical models, and ML enabling computers to analyze, generate, interpret, respond, and understand both oral and written human language, facilitating seamless interaction between humans and machines.
Let’s break down the definition of NLP into its core components and explore each one in more detail. We’ll use examples and metaphors to help clarify the concepts for those who might be new to the world of AI and ML.
Subfield of AI
NLP is a specific area within the broader field of AI. AI focuses on creating intelligent machines that can perform tasks typically requiring human-like intelligence, such as understanding language, recognizing speech, or making decisions. NLP is one of the key areas of AI that deals specifically with language-related tasks.
Interdisciplinary
NLP leverages various techniques and approaches to enable computers to understand and process human language effectively. It combines three main areas: linguistics, statistics, and ML. First, NLP involves computational linguistics. This involves using rule-based methods to analyze the structure, grammar, and meaning of language. Think of it as teaching a computer the basic rules of a language, similar to how we learn grammar in school. Second, NLP involves a range of statistical models. These are mathematical models that capture patterns and relationships in language data. Imagine you’re trying to guess the next word in a sentence based on the words that come before it. Statistical models learn these patterns from large amounts of text data to make predictions. Finally, NLP uses ML. This involves training computer algorithms to learn from data and improve their performance over time. It’s like teaching a child to recognize different animals by showing them many examples. The more examples the child sees, the better they become at identifying animals correctly.
Analyze, Generate, Interpret, Respond and Understand
NLP empowers computers to perform five language-related tasks:
- Analyze: NLP allows computers to break down human language into smaller components, such as words, phrases, and sentences, to understand its structure and meaning. It’s like dissecting a sentence to understand its parts and how they fit together.
- Interpret: NLP enables computers to grasp the meaning and context of human language. It’s like being able to understand the underlying message or intent behind someone’s words.
- Generate: NLP allows computers to create human-like language, such as writing articles, composing emails, or generating responses in a conversation. It’s like having a computer write a story or an essay on its own.
- Respond: NLP enables computers to provide appropriate responses to human queries or commands. It’s like having a virtual assistant that can understand your questions and give you relevant answers.
- Understand: NLP helps computers comprehend the nuances, emotions, and intentions conveyed through human language. It’s like being able to read between the lines and understand the subtle meanings in someone’s words.
Oral and Written Human Language
NLP deals with both spoken and written forms of human language, including oral and written language. First, NLP can process and understand speech, enabling applications like voice assistants (e.g., Siri, Alexa) or speech-to-text transcription. Second, NLP can analyze and understand text data (e.g., articles, books, emails, social media posts, etc.).
Human-Machine Interaction

The ultimate goal of NLP is to enable smooth and natural communication between humans and computers. It aims to bridge the gap between how humans express themselves and how computers process information. By allowing computers to understand and respond to human language, NLP makes it easier for people to interact with machines without needing specialized technical knowledge.
Metaphorically, you can think of NLP as a translator that helps humans and computers understand each other. Just like a human translator facilitates communication between people who speak different languages, NLP acts as an intermediary between human language and computer language, enabling them to exchange information and work together effectively.
NLP has numerous applications in our daily lives. Some common examples include:
- Virtual assistants: NLP powers virtual assistants like Siri, Alexa, or Google Assistant, allowing them to understand and respond to our voice commands and queries.
- Language translation: NLP enables automatic translation between different languages, making it easier for people to communicate across language barriers.
- Sentiment analysis: NLP can analyze text data, such as customer reviews or social media posts, to determine the overall sentiment or emotion expressed (e.g., positive, negative, or neutral). (More on this in a moment.)
- Chatbots: NLP enables chatbots to understand and respond to user queries in a conversational manner, providing customer support or assistance.
- Text summarization: NLP can automatically generate concise summaries of long articles or documents, saving time and effort in information processing.
Major Aspects of NLP
At its core, NLP involves several key aspects that work together to enable computers to process and understand human language effectively. These aspects include language translation, text analysis, and speech recognition.
Language Translation
Imagine you’re traveling to a foreign country where you don’t speak the local language. With language translation technology, you can simply type or speak a phrase in your native language, and the computer will automatically translate it into the target language. This has opened up new possibilities for global communication and understanding.
Language translation allows people who speak different languages to communicate with each other in almost real time. The AI tools have made significant strides in recent years, thanks to advances in ML and deep learning techniques. Many people are familiar with this type of NLP because of tools like Google Translate, DeepL, or Microsoft Translate. Admittedly, language experts say that these translations often lack an understanding of the nuance of actual language use, so these translations are not always the best. With newer LLMs, there is hope that the translations from these models will get better.
Translation Approaches
Language translation involves several key approaches and techniques, including machine translation, neural machine translation, encoder-decoder models, and attention mechanisms. Let’s explore each of these in more detail with examples to help us better understand them.
Machine Translation
Machine translation is the process of automatically translating text from one language to another using computer algorithms. For example, translating a sentence from English to Spanish. Traditional machine translation approaches, such as rule-based and statistical machine translation, rely on explicit linguistic rules and statistical models. Imagine you have a phrase like, “Hello, how are you?” The rule-based approach would use a set of predefined rules to translate each word and its grammatical structure. In contrast, the statistical approach would analyze large amounts of bilingual text to find the most likely translation based on patterns. These methods require extensive linguistic knowledge and large parallel corpora (texts available in both the source and target languages) to train the translation models.
Neural Machine Translation
Neural machine translation (NMT) is a more recent approach that leverages deep learning techniques, particularly neural networks, to perform language translation. NMT has revolutionized the field by providing more accurate and fluent translations than traditional approaches.[8] For example, when translating “I love to play soccer” from English to French, an NMT model can capture the nuances and context better, producing “J’aime jouer au football” instead of a more literal and possibly awkward translation. NMT models learn to translate by directly mapping the source language text to the target language text without relying on explicit linguistic rules. They are trained on large parallel corpora and learn to capture the semantic and syntactic relationships between the languages.
Encoder-Decoder Models
Encoder-decoder models are a common architecture used in NMT, consisting of two main components: an encoder and a decoder. The encoder is a neural network that takes the source language text as input and converts it into a fixed-length representation, capturing the meaning and context of the input. This representation is often a dense vector or a sequence of vectors, which are sequences of numbers representing data in a way that the neural network can process.
The decoder, another neural network, takes the encoded representation from the encoder and generates the target language text. It produces the translation word by word or phrase by phrase, using the encoded representation to understand the context and meaning of the source text. In the context of neural networks, a vector is a crucial element that captures important information about the input, such as meaning and context.
For instance, when translating the sentence “The weather is nice today” into Japanese, the encoder takes the English sentence and converts it into a fixed-length representation, a vector, that captures the essence of the sentence. This vector might look something like [0.5, -0.3, 0.8, …] (with the actual numbers being determined by the neural network during training). The decoder then takes this vector and generates the Japanese translation, “今日は天気がいいです,” word by word or phrase by phrase, using the encoded representation to produce the most accurate translation. This process allows the model to handle complex translations by breaking down the task into encoding the source text into a meaningful representation and then decoding it into the target text.
Sidebar: Understanding Vectors in ML
What is a Vector?
In ML, a vector is an ordered set of numbers that can represent various types of data, such as the features of an image, the words in a sentence, or even abstract concepts like emotions or topics. Think of a vector as a list of values, each representing a specific feature or characteristic of the data. These values can capture important information, such as meaning and context, which the ML model can use to make predictions or decisions.
How Vectors Are Used in ML
Vectors play a crucial role in ML because they allow complex data to be converted into a numerical format that algorithms can understand. Here’s how they are typically used:
- Data Representation: Vectors represent different data in a consistent, numerical form. For example, a vector can represent the colors in an image, the words in a sentence, or the attributes of a product.
- Feature Extraction: In text processing, each word in a sentence can be represented as a vector. These word vectors capture the semantic meaning of words based on their usage in large datasets.
- Model Input: ML models, such as neural networks, take vectors as input. These vectors contain the essential information needed for the model to learn patterns and make predictions.
Example of a Vector
Let’s consider a simple example. Imagine we want to represent the colors in an image using vectors. Each pixel in the image can be represented as a vector of three numbers, corresponding to the red, green, and blue (RGB) color values. For instance, red might be represented as [255, 0, 0], where 255 is the intensity of the red color, and 0 indicates no green or blue color.
In NLP, a sentence like “I love to play soccer” can be converted into a sequence of word vectors. Each word (“I,” “love,” “to,” “play,” “soccer”) is represented by a vector that captures its meaning and relationship to other words. In NLP, this is often done using pre-trained word embeddings like Word2Vec or GloVe, where each word is mapped to a high-dimensional vector space. Let’s use Word2Vec to explore our phrase “I love to play soccer” and represent each word as a vector. Word2Vec is a technique developed by researchers at Google that creates word vectors by training a neural network on a large corpus of text. Here are example vectors for each word (note that these are simplified for illustration purposes):
- I: [0.1, 0.3, 0.5, 0.7]
- love: [0.8, 0.6, 0.2, 0.1]
- to: [0.3, 0.4, 0.7, 0.2]
- play: [0.6, 0.9, 0.1, 0.4]
- soccer: [0.9, 0.2, 0.5, 0.8]
Each vector is a simplified representation of the word’s meaning in a four-dimensional space. In practice, these vectors would be much longer, capturing more nuanced relationships between words.
Word2Vec and GloVe
Word2Vec and GloVe are two popular techniques for creating word vectors:
- Word2Vec: Developed by researchers at Google, Word2Vec is a method that creates word vectors by training a neural network on a large corpus of text. The resulting vectors capture semantic relationships between words. For example, the words “king” and “queen” would be close to each other in the vector space, as would “cat” and “dog.” Word2Vec uses two main approaches: Continuous Bag of Words (CBOW), which predicts a target word from its context words, and Skip-gram, which predicts context words from a target word.
- GloVe (Global Vectors for Word Representation): Developed by researchers at Stanford, GloVe creates word vectors by analyzing the global statistical information of a corpus. It constructs a co-occurrence matrix, which counts how often words appear together in a given context. By factorizing this matrix, GloVe produces word vectors that capture both the local context and the global statistical information of the corpus. GloVe vectors are effective at capturing analogies, such as “man is to woman as king is to queen.”
Vectors are fundamental in ML because they transform complex, varied data into a standardized numerical format that models can process. By understanding vectors and their role, we can better appreciate how ML algorithms analyze and interpret data to make intelligent decisions.
Attention Mechanisms
Attention mechanisms are a key component of modern NMT models. They allow the model to focus on different parts of the input sequence when generating each word in the output sequence.
Think of attention as a way for the model to “pay attention” to specific words or phrases in the source text that are most relevant for generating each word in the target text. By selectively focusing on different parts of the input, attention mechanisms help improve the quality and coherence of the translations.
Language Translation in Our Lives
Language translation has numerous applications, including:
- Cross-lingual communication: Enabling people who speak different languages to communicate with each other, whether it’s through written text or spoken language.
- Localization and globalization: Translating content, such as websites, documents, or software interfaces, into multiple languages to reach a wider audience.
- Multilingual information retrieval: Allowing users to search for information in one language and retrieve relevant results in another language.
- Language learning: Providing translations and language learning resources to assist in language acquisition and comprehension.
Text Analysis

Text analysis is a fundamental aspect of NLP that focuses on making sense of written language. It’s like teaching a computer to read and understand text the way humans do. By breaking down text into smaller components and examining their structure and meaning, computers can extract valuable insights and perform various language-related tasks.
Imagine you have a large collection of customer reviews for a product. As a human, you can easily read through each review, understand the opinions expressed, and identify key information like product features or customer sentiments. Text analysis aims to replicate this human-like understanding of text using computational methods.
Text Analysis Techniques
Text analysis involves several key techniques, including tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis. Let’s explore each of these techniques in more detail.
Tokenization and Context Windows
In NLP, understanding and processing text involves breaking it down into smaller units and considering the surrounding context. This section explores the concepts of tokenization and context windows.
Tokenization
Tokenization is the process of breaking down text into smaller units, such as individual words, phrases, or characters, which are called tokens. It’s like splitting a sentence into its constituent parts. This process is essential because computers cannot directly process raw text – they need structured data that can be converted into numerical representations for processing.
Approaches to Tokenization
Modern language models employ three main approaches to tokenization. The first is word-level tokenization, which splits text into complete words. For instance, the phrase “Artificial Intelligence is fascinating” would be broken into the tokens “Artificial,” “Intelligence,” “is,” and “fascinating.” While this approach is simple and intuitive, it often struggles with rare words and variations of common words.
The second approach, sub-word tokenization, has become increasingly popular in modern language models. This method breaks words into meaningful parts, allowing for more flexible processing of language. Using our previous example, “Artificial Intelligence is fascinating” might be tokenized as “Art,” “##ificial,” “Intelli,” “##gence,” “is,” “fas,” and “##cinating.” This approach is particularly effective for handling rare words and morphological variations, making it the preferred choice for many current language models including GPT-4.
The third approach, character-level tokenization, breaks text down into individual characters. Although this method is the most granular, breaking even simple words like “AI” into separate tokens “A” and “I,” it requires more processing power. However, this approach can be useful for handling complex language structures and unknown words.
Impact on Model Performance
The choice of tokenization method significantly impacts a model’s performance and efficiency. Modern LLMs typically favor sub-word tokenization as it offers an optimal balance between vocabulary size and model effectiveness. This approach helps models generalize better across languages and handle previously unseen words more effectively.
Different language models have varying limitations on how many tokens they can process at once. For instance, GPT-4 has a context window of 8,192 tokens for combined input and output. When processing text, these models handle different types of content in specific ways. Common words are typically treated as single tokens, while rare or complex words are broken into sub-words. Special characters, including punctuation and spaces, are treated as separate tokens.
Context Windows
Besides tokenization, understanding the context in which tokens appear is crucial for many NLP tasks. This is where the concept of context windows comes into play. A context window, sometimes called a sliding window, is a technique used to capture the surrounding words or tokens of a target token to provide context.
For example, let’s use the same sentence: “The quick brown fox jumps over the lazy dog.” If we consider the word “fox” as our target token and use a context window size of 2, we would look at two tokens to the left and two tokens to the right of “fox.” The context window would include [“quick,” “brown,” “jumps,” “over”].
By incorporating context windows, we can better understand the relationships and dependencies between words. For instance, knowing that “quick” and “brown” appear before “fox,” and “jumps” and “over” appear after, helps the model grasp the meaning of the sentence more effectively.
Modern NLP models, like Google’s Gemini, have significantly expanded the concept of context windows. The Gemini 1.5 model uses a context window size of up to 2 million tokens.[9] This means the model can consider an extensive amount of surrounding text when processing each token. So, why are large context windows important? First, with a larger context window, models can understand long-range dependencies and relationships between tokens. This is crucial for tasks that require a comprehensive understanding, such as summarizing long documents, translating complex text, or answering detailed questions. Second, in tasks like text generation or dialogue systems, larger context windows help maintain coherence over longer texts, ensuring that the generated output stays relevant and contextually accurate throughout. Third, with models dealing with intricate tasks (e.g., legal document analysis, scientific research papers, literature reviews, etc.), a large context window allows them to consider more information, leading to more accurate and nuanced interpretations. Finally, for search engines or virtual assistants, large context windows enable the models to understand and process complex queries that reference information from earlier parts of a conversation or text, improving their ability to provide relevant and precise answers.
Tokenization and Large Context Windows Together
When used together, tokenization and large context windows enable sophisticated text processing. Tokenization splits the text into manageable pieces, while large context windows provide the extensive context needed for each token. This combination is useful in various NLP applications, such as:
- Long Document Summarization: Summarizing books or lengthy articles by considering the entire text to capture all important points.
- Detailed Question Answering: Answering complex questions by referencing a large amount of preceding text to provide accurate and comprehensive responses.
- Cross-Document Analysis: Understanding references and connections across multiple documents, useful in research and legal fields.
For example, in a long legal document, tokenization breaks the text into smaller units, and a large context window ensures the model can understand the context of each token by considering the entire document, leading to more accurate and contextually relevant outputs.
Part-of-Speech Tagging
Part-of-speech (POS) tagging involves identifying the grammatical role of each word in a sentence. It’s like labeling each word with its part of speech, such as noun, verb, adjective, or adverb. POS tagging helps computers understand the structure and meaning of a sentence.
Going back to our example sentence, POS tagging would assign labels like this: [(“The,” Determiner), (“quick,” Adjective), (“brown,” Adjective), (“fox,” Noun), (“jumps,” Verb), (“over,” Preposition), (“the,” Determiner), (“lazy,” Adjective), (“dog,” Noun)]. By identifying the parts of speech, computers can better understand the relationships between words and the overall meaning of the sentence.
Named Entity Recognition
Named Entity Recognition (NER) is the task of identifying and classifying named entities in text. Named entities are specific types of information, such as person names, organizations, locations, dates, or quantities. NER helps computers extract meaningful information from unstructured text data.
For instance, let’s say you have a news article that mentions “Apple” and “Steve Jobs.” NER would identify “Apple” as an organization and “Steve Jobs” as a person. This information can be useful for various applications, such as information retrieval, question answering, or content categorization.
Sentiment Analysis
Sentiment analysis involves determining the sentiment or emotion expressed in a piece of text. It’s like gauging the overall positivity, negativity, or neutrality of a text document. Sentiment analysis is particularly useful for understanding customer opinions, social media trends, or brand perception. Last year, I was working on an evaluation of student perspectives on the quality of our department and specific learning targets. We asked several open-ended qualitative questions. I was able to put those responses into ChatGPT and have it run a sentiment analysis and get feedback that mirrored what a more traditional scholar found after conducting a manual analysis. Sentiment analysis can be performed at different levels, such as the document level (determining the overall sentiment of a document), the sentence level (determining the sentiment of individual sentences), or the aspect level (determining the sentiment toward specific aspects or features mentioned in the text).
Text Analysis in Society
These are just a few of the key techniques involved in text analysis, but they form the foundation for more advanced NLP tasks. Understanding these techniques is crucial for anyone working with text data, as they provide valuable insights into the structure, meaning, and sentiment of written language.
Text analysis has numerous applications across various domains. Some common examples include:
- Content categorization: Categorizing text documents into predefined categories or topics.
- Opinion mining: Extracting and analyzing opinions, sentiments, or emotions expressed in text data, such as customer reviews or social media posts.
- Information extraction: Identifying and extracting specific pieces of information from unstructured text, such as names, dates, or locations.
- Text summarization: Generating concise summaries of long text documents while preserving the key information.
As NLP continues to advance, text analysis techniques are becoming more sophisticated and accurate. Researchers and practitioners are continuously exploring new approaches and algorithms to improve the performance and efficiency of text analysis tasks.
The Predictive Nature of LLMs
At their core, LLMs are sophisticated prediction engines that generate text by calculating probabilities of what words should come next in a sequence. Unlike traditional computer programs that follow strict rules, these models learn patterns from vast amounts of text data and use these patterns to make educated guesses about appropriate language use. To accomplish this complex task, they rely on an intricate system of adjustable parameters that help them process and generate human-like text.
Introduction to Neural Network Parameters
LLMs operate through complex neural networks that use parameters to make decisions about language generation. These parameters function as adjustable elements within the model, similar to knobs that can be fine-tuned to improve performance. The two primary types of parameters are weights, which determine the relative importance of different inputs, and biases, which help adjust the model’s outputs based on learned patterns from training data.
Scale and Complexity
The scale of modern LLMs is staggering. For example, GPT-3 contains 175 billion parameters distributed across 96 layers of neural networks, while GPT-4 is estimated to have approximately 1.8 trillion parameters across 120 layers. This massive increase in parameters and layers enables these models to capture increasingly subtle nuances in language and generate more sophisticated outputs. Each parameter contributes to the model’s ability to recognize and reproduce complex language patterns.
How LLMs Generate Text

LLMs generate text through a sophisticated process of probabilistic prediction. Figure 1 illustrates this process using a simple example: the familiar phrase “Twinkle, twinkle, little…” When presented with these input words, the model processes them through multiple hidden layers of neural networks to predict the next word. Based on its training data, the model assigns different probabilities to potential next words:
- “Star” – 90% probability (highest, due to the famous nursery rhyme)
- “Light” – 5% probability (poetically plausible)
- “One” – 2% probability (grammatically possible)
- “Thing” – 2% probability (grammatically possible)
- Other words (like “cloud” or “bird”) – 1% or less
The model typically selects the word with the highest probability as its output, though this can be adjusted through temperature settings that control randomness in the generation process.
Sequential Word Selection
This predictive process continues word by word as the model generates text. Consider how an LLM might construct a sentence about AI: It begins by selecting “Artificial” (with a 75% probability based on the context), followed by “Intelligence” (95%) as an almost certain continuation. The model then predicts appropriate verbs like “enhances” (80%) and relevant objects like “decision-making” (85%), concluding with contextually appropriate terms like “efficiency” (70%).
Training and Pattern Recognition
This predictive capability stems from extensive training on large text datasets. During training, the model learns to recognize patterns in how words and phrases often appear together. The model’s parameters are adjusted through this process to capture these patterns, enabling it to generate text that mirrors natural language use. When presented with new input, the model applies these learned patterns to predict the most appropriate continuations.
Impact of Context
The model’s predictions are heavily influenced by context. Each prediction considers not only the preceding word, but also the entire available context within its attention window. This enables the model to maintain coherence across longer passages and generate text that remains contextually appropriate throughout extended responses.
Limitations and Considerations
Although this predictive approach is powerful, it’s important to understand its limitations. The model can only make predictions based on patterns in its training data, and these predictions are always probabilistic rather than deterministic. This means that while the model can generate highly coherent and contextually appropriate text, it may occasionally produce outputs that are statistically likely but factually incorrect or contextually inappropriate.
Speech Recognition
Speech recognition is another crucial aspect of NLP that focuses on converting spoken language into written text. It’s like having a computer listen to what you say and transcribe it into words on a screen. Speech recognition has come a long way in recent years, enabling applications like virtual assistants, voice-controlled devices, and automated transcription services.
Imagine you’re driving in your car and need to send a text message. Instead of typing on your phone, you can speak your message aloud, and a speech recognition system will convert your words into text. This is just one example of how speech recognition is making our lives more convenient and hands-free.
Components of Speech Recognition
Speech recognition involves several key components, including acoustic modeling, language modeling, and feature extraction. Let’s explore each of these components in more detail.
Acoustic Modeling
Acoustic modeling is the process of creating statistical representations of the acoustic properties of speech. Have you ever been to a concert and been able to identify the instruments just by listening to their sounds? Acoustic modeling does something similar but with speech. It involves analyzing the sound waves of speech and identifying the distinct units of sound, called phonemes, that makeup words and sentences. Think of acoustic modeling as teaching a computer to recognize the building blocks of speech. Just like how we learn to recognize the sounds of different letters and words when learning a language, acoustic modeling helps computers understand the acoustic patterns associated with different speech sounds. Acoustic models are typically trained on large datasets of speech recordings, where each speech segment is labeled with its corresponding phoneme or word. By analyzing these labeled examples, the acoustic model learns to map the acoustic features of speech to their corresponding linguistic units.
Language Modeling
Language modeling is the process of building statistical models that capture the structure and patterns of language. Have you ever been in a conversation and you found yourself finishing each other’s sentences? We do this because we can predict what the other person might say next based on what they’ve already said. Language modeling does the same thing for computers. It involves analyzing large amounts of text data to learn the probability distributions of words and phrases in a language. It’s very common that when we’re reading we’ll come across a word with multiple possible meanings. As a human, you can usually determine the correct meaning based on the context of the sentence. Language modeling aims to replicate this ability in computers by providing them with knowledge about the likelihood of different word sequences. Language models are trained on vast amounts of text data, such as books, articles, or web pages. By analyzing the co-occurrence patterns of words and phrases in this data, the language model learns to predict the likelihood of a word or phrase appearing in a given context.
Feature Extraction
Feature extraction is the process of identifying and extracting relevant features from speech signals that can be used for speech recognition. It involves analyzing the acoustic properties of speech and converting them into a compact representation that captures the essential information. Just like how a detective looks for clues and evidence at a crime scene, feature extraction techniques look for specific patterns and attributes in the speech signal that can help distinguish different sounds and words. Common features used in speech recognition include spectral features (e.g., mel-frequency cepstral coefficients), temporal features (e.g., pitch, energy), and linguistic features (e.g., phonemes, syllables). These features are extracted from the speech signal using various signal processing techniques, such as Fourier transforms and filterbanks.
Practical Speech Recognition Applications
Speech recognition systems combine acoustic modeling, language modeling, and feature extraction to convert spoken language into written text. The acoustic model maps the speech features to phonemes or words, while the language model provides context and helps disambiguate between similar-sounding words. The feature extraction process ensures that the most relevant information is captured from the speech signal.
Speech recognition has numerous applications, including:
- Virtual assistants: Enabling voice-controlled interfaces for devices like smartphones, smart speakers, and car infotainment systems.
- Transcription services: Automatically transcribing speech from audio or video recordings, such as meetings, lectures, or interviews.
- Voice-based authentication: Using speech as a biometric feature for user identification and authentication.
- Accessibility tools: Providing speech-to-text capabilities for individuals with hearing impairments or disabilities.
As speech recognition technology continues to advance, researchers and developers are exploring new techniques to improve its accuracy, robustness, and efficiency. Deep learning approaches, such as neural networks and sequence-to-sequence models, have significantly enhanced the performance of speech recognition systems in recent years.
Key Takeaways
- NLP is a subfield of AI that combines computational linguistics, statistical models, and ML to enable computers to understand and process human language.
- Techniques like tokenization, part-of-speech tagging, and NER are used to break down and analyze text for various applications.
- NLP is used in virtual assistants, language translation, sentiment analysis, chatbots, and text summarization, enhancing human-computer interaction.
- NLP faces challenges in dealing with the complexity and variability of human language, requiring continuous advancements to improve language understanding systems.
Exercises
- Develop a simple NLP application, such as a chatbot or sentiment analysis tool, using available NLP libraries.
- Perform text analysis on a dataset, identifying named entities and sentiment.
- Research and present on a specific application of NLP, discussing its technology and impact.
Computer Vision
Learning Objectives
-
Define computer vision and its purpose in AI.
-
Identify the key components of computer vision systems.
-
Describe techniques used in computer vision, including image classification, object detection, and image segmentation.
-
List common applications of computer vision in fields such as transportation, healthcare, and security.
-
Explain challenges faced in computer vision, such as occlusion, lighting variation, and large-scale data processing.
Computer vision is the field of AI that focuses on enabling computers (and smart devices) to interpret and understand visual information from the world around us. It’s like giving machines the gift of sight, allowing them to perceive and analyze images and videos in a way that mimics human vision. Let’s look at a simple example.

Figure 2.16 demonstrates how human vs. computer vision works. Let’s start by looking at human vision. A human sees a bowl of fruit with their eyes. The visual cortex in the human brain processes the image, and we then interpret what we are seeing. In this case, we see an image of a bowl of fruit. There’s a bowl, some apples, a banana, and some other (unidentified) fruit.
Computer vision works in much the same way. A sensing device (e.g., a webcam, cellphone camera, etc.) takes a picture of the image. A computer uses processing power to interpret what it is seeing, and then the computer provides output. In this case, we ran the image through ChatGPT and asked it to describe the image itself. Here is what ChatGPT said, “This image is a flat, vector-style illustration featuring a bowl full of fruit. The bowl has a striped pattern in warm tones, and it’s brimming with stylized, colorful fruits: red cherries, red and green apples, a bunch of yellow bananas, and green grapes. The artwork is characterized by clean lines, bold flat colors, and a playful yet modern aesthetic.”
Image Recognition
Image recognition is a fundamental aspect of computer vision that focuses on teaching computers to identify and understand the contents of an image. It’s like giving machines the ability to perceive and interpret visual information, similar to how humans recognize objects and scenes.
Imagine flipping through a photo album and effortlessly recognizing the people, places, and objects in each picture. You can easily distinguish between a cat and a dog, identify your favorite vacation spot, or recognize the face of a loved one. Image recognition aims to replicate this human-like understanding of images in computers.
Image recognition involves analyzing the patterns, colors, textures, and shapes within an image to extract meaningful information. It’s like solving a puzzle, where the computer looks for specific features and characteristics that help it determine what’s in the image. Image recognition encompasses several key tasks, including image classification, object localization, and segmentation. Let’s explore each of these tasks in more detail.
Image Classification
Image classification is the task of assigning a label or category to an entire image based on its content. It’s like playing a game of “What’s in this picture?” with a computer. Given an image, the computer’s goal is to determine the most appropriate label that describes the overall scene or the main object present.
For example, if you show a computer an image of a cat, the image classification task would involve the computer analyzing the image and predicting the label “cat” as the most likely category. Similarly, if you present an image of a beach sunset, the computer would classify it as “beach” or “sunset.”
Object Localization
Object localization goes a step further than image classification by not only identifying the objects present in an image but also determining their precise locations. It’s like asking the computer, “What objects are in this picture, and where exactly are they located?”
In object localization, the computer not only predicts the labels of the objects but also outputs bounding boxes or coordinates that indicate the spatial location of each object within the image. This allows the computer to pinpoint the exact position and extent of the objects it recognizes.
For instance, given an image of a busy street scene, object localization would involve the computer identifying objects like cars, pedestrians, traffic lights, and buildings, and drawing bounding boxes around each of them to show their precise locations within the image.
Segmentation

The image in Figure 2.17 illustrates three primary segmentation techniques: semantic segmentation, instance segmentation, and panoptic segmentation.[10] These techniques are fundamental in image processing, particularly for applications like autonomous driving, industrial automation, and mining, as discussed in the research article.
Semantic Segmentation
Semantic segmentation involves assigning a specific class label to every pixel in an image. As shown in Figure 2.17b, all objects of the same class share the same color. For example, all cars are marked with the same blue color, and all pedestrians with pink, regardless of individual differences. This method is useful when the goal is to understand the overall structure of the scene, such as differentiating between roads, buildings, and vehicles. However, it does not distinguish between different instances of the same object type.
In the mining industry, semantic segmentation could classify entire regions of an underground scene as “rock,” “machinery,” or “workers,” providing a high-level understanding of the environment with no need to track individual items.
Instance Segmentation
Instance segmentation, depicted in Figure 2.17c, extends semantic segmentation by not only classifying each pixel but also identifying distinct instances of each object category. For instance, instead of simply labeling all cars as “car,” it distinguishes between Car 1, Car 2, and so on, with unique colors or bounding boxes for each. This level of detail is critical in scenarios where it is necessary to track or manipulate individual objects, such as detecting specific equipment in a mining operation or recognizing individual pedestrians for safety in autonomous driving.
For mining, instance segmentation can help identify specific machines or safety equipment, allowing for better inventory management and enhanced safety protocols by tracking individual items.
Panoptic Segmentation
Panoptic segmentation, shown in Figure 2.17d, combines the best of semantic and instance segmentation. It classifies each pixel while also distinguishing between separate instances of each object class. In the image, this method not only categorizes general elements of the scene but also clearly distinguishes individual objects. Panoptic segmentation offers a comprehensive approach, making it valuable for complex environments where both the layout and individual elements need to be understood simultaneously.
In the mining case study, panoptic segmentation could offer a detailed analysis of both the operational environment (e.g., identifying tunnel walls and floors) and specific objects (e.g., distinguishing between different pieces of equipment or workers).
Object Detection
Object detection is another crucial aspect of computer vision that builds upon the foundations of image recognition. While image recognition focuses on identifying and classifying the contents of an image, object detection takes it a step further by localizing and detecting multiple objects within an image or video stream in real time.
Imagine you’re walking down a busy street, and your eyes effortlessly detect and track various objects around you, such as cars, pedestrians, traffic signs, and buildings. Object detection aims to replicate this human-like ability in computers, enabling them to detect, localize, and classify objects in complex scenes.
The goal of object detection is to identify objects present in an image and determine their precise locations and spatial extents. It’s like having a computer that can not only tell you, “There’s a car, a person, and a traffic light in this image,” but also draw bounding boxes around each object to show you exactly where they are located.
Object detection has many applications, ranging from autonomous driving and surveillance systems to robotics and augmented reality. It enables computers to understand and interact with the dynamic visual world in real time, making decisions based on the detected objects and their locations.
Several techniques and algorithms have been developed for object detection, each with its own strengths and tradeoffs.
Region-Based CNNs (R-CNNs)
R-CNNs are a family of object detection algorithms that combine the power of CNNs with region proposal methods. Think of R-CNNs as a detective team. First, the team identifies potential locations (regions) where clues (objects) might be found. Then, each region is thoroughly examined using advanced tools (CNNs) to identify the specific clues and refine their locations.
The original R-CNN algorithm follows a multi-stage pipeline. First, it uses a region proposal method, such as selective search, to generate a large number of potential object regions, similar to a detective scanning a city for suspicious areas. Then, it extracts features from each region using a pre-trained CNN and classifies the objects using a set of class-specific linear classifiers, much like sending samples to a lab for detailed analysis. Finally, it refines the bounding boxes using regression techniques to pinpoint the exact location of the objects.
Later versions of R-CNNs, like Fast R-CNN and Faster R-CNN, have built on the original algorithm by adding region of interest pooling and region proposal networks (RPN) to make the detection process faster and more accurate. Think of these improvements as giving the detective team faster cars and better lab equipment, allowing them to solve cases more quickly and accurately.
You Only Look Once (YOLO)
YOLO is a real-time object detection algorithm that uses a different approach than R-CNNs. Instead of using a region proposal step, YOLO treats object detection as a single regression problem (see sidebar), directly predicting bounding boxes and class probabilities from full images in one evaluation. Imagine you’re playing a fast-paced game where you have to identify as many objects as possible in a single glance. That’s how YOLO works, quickly scanning the entire image and making predictions in one go.
Single Shot MultiBox Detector (SSD)
SSD is another popular object detection algorithm that aims to achieve a balance between speed and accuracy. Like YOLO, SSD treats object detection as a regression problem, eliminating the need for a separate region proposal step. Imagine you’re a lifeguard scanning a crowded beach. Instead of focusing on specific areas one at a time, you quickly glance over the entire beach, ready to identify and respond to any issues.
The main idea behind SSD is to use a set of default bounding boxes of different sizes and aspect ratios at various locations in the image. The network predicts the presence of objects and their class probabilities for each default box. Additionally, SSD uses feature maps from different convolutional layers to handle objects at different scales. Think of these feature maps as different lenses that help you see both close-up details and distant objects clearly.
SSD is known for its good balance between speed and accuracy. It can achieve real-time performance while maintaining competitive accuracy compared to other object detection methods. SSD has been widely used in applications like face detection, pedestrian detection, and traffic sign recognition, much like a lifeguard who can quickly and accurately spot swimmers in need of help.
Facial Recognition
Facial recognition is a specialized application of computer vision that focuses on detecting, analyzing, and identifying human faces in images or video streams. It’s like giving computers the ability to recognize and remember people’s faces, just like we do in our daily lives.
Imagine you’re at a crowded party, and you can easily spot your friends among the sea of faces. You can recognize them even if they’re wearing different clothes or have changed their hairstyle since the last time you saw them. Facial recognition aims to replicate this human-like ability in computers, enabling them to detect, verify, and identify individuals based on their facial features.
Facial recognition has become increasingly important in various domains, from security systems and law enforcement to social media tagging and personalized marketing. It relies on advanced algorithms and techniques to analyze the unique characteristics of human faces, such as the distance between the eyes, the shape of the nose, and the contours of the jawline.
Facial recognition involves several key tasks: face detection, face verification, and face identification.
Face Detection
Face detection is the first step in facial recognition, where the goal is to locate and identify the presence of human faces in an image or video frame. It’s like asking the computer, “Are there any faces in this picture, and if so, where are they located?”
Face detection algorithms scan the image and search for specific patterns and features that are characteristic of human faces, such as the arrangement of eyes, nose, and mouth. Once a face is detected, the algorithm typically outputs the coordinates of a bounding box that encapsulates the face region.
Before November 2021, Meta had a facial recognition software built into the Facebook platform. When individuals uploaded pictures, their facial recognition system scanned faces and tagged individuals within the pictures. In fact, Facebook boasted a 97.25% correct identification score (humans score 97.53%).[11] Because of safety and privacy concerns, the facial recognition software was disabled in 2021.[12]

Face Verification
Face verification, also known as face authentication, is the task of determining whether two face images belong to the same person. It’s like asking the computer, “Are these two faces of the same individual?”
In face verification, the algorithm compares two face images and outputs a similarity score or a binary decision (same person or different persons). This is typically done by extracting discriminative features from each face image and comparing them using a similarity metric or a trained classifier.
Face verification is commonly used in access control systems, where a person’s face is compared against a pre-enrolled face template to grant or deny access. It’s also used in applications like phone unlocking, where the user’s face is verified to authenticate their identity.
When I bought my first Microsoft Surface, part of the onboarding process involved taking a picture of myself, which was used to unlock the device from then on. That was my first experience using facial recognition to unlock one of my devices. Today, I use face verification on many different tools, from my computer to my smartphone. We’ve become increasingly used to face verification as a security tool.
The Transportation Security Administration (TSA) started rolling out face verification stations at many airports around the United States in 2024. Credential Authentication Technology (CAT-2) units feature advanced digital ID readers and a built-in camera that captures a real-time photo of the traveler. AI algorithms then compare the image on the ID with the person standing there. The Office of Inspector General at the US Department of Homeland Security initiated a study at the request of a bipartisan group of senators in November 2024. The goal of the study is to evaluate the use of identity authentication and biometric technology by the TSA to ensure that appropriate safeguards are in place to protect passenger privacy.[13]
Face Identification
Face identification is the task of identifying a person’s identity based on their face image. It’s like asking the computer, “Who is this person in the picture?”
In face identification, the algorithm compares a given face image against a database of known individuals and outputs the identity of the best match. This is typically done by extracting features from the input face image and comparing them against the features of the enrolled faces in the database.
Face identification has various applications, such as in law enforcement for identifying suspects or missing persons, in social media for automatically tagging friends in photos, and in customer service for personalized experiences.
Face identification, although amazing technology, does have a dark side. In 2022, a personal injury lawyer from New Jersey named Kelly Conlon was escorting her daughter’s Girl Scout troop to see the Rockett’s “Christmas Spectacular” at Radio City Music Hall. Before she knew what was happening, security guards were escorting her from the building.[14] She found out that Radio City Music Hall’s parent company, Madison Square Gardens (MSG) Entertainment, had placed her on an “attorney exclusion list.” MSG Entertainment had created a ban on attorneys involved in litigation against the company and all of the employees who worked for any law firm suing them. Even though Conlon was not actively involved in litigation against MSG Entertainment, her law firm had a client who was suing MSG Entertainment. Of course, the question becomes, how did they even know who Kelly Conlon was? If you’re guessing face identification, you guessed correctly. MSG Entertainment had created a database of people who were not allowed in any of their venues and used face identification software to scan every face who entered. When Conlon’s face was identified as a member of a law firm currently in litigation against MSG Entertainment, she was recognized, and security escorted her from the building. As of spring 2024, there is a class action lawsuit against MSG Entertainment brought by various individuals who have been barred from MSG Entertainment venues as a result of face identification technology.
Sidebar – Face Recognition and Bias
Facial detection software has advanced rapidly in recent years, but it is essential to be aware of the potential biases that can emerge. Research has revealed that certain facial recognition algorithms have higher error rates when identifying individuals with darker skin tones.[15] This disparity can lead to significant consequences, such as the wrongful identification of innocent people as criminal suspects.
The biased performance of these systems raises serious questions about fairness and highlights the need for thorough evaluation before widespread implementation. It is crucial to ensure that this technology operates equitably for all individuals, regardless of their demographic background.
Scientists and researchers are diligently working to mitigate these biases, but significant progress remains.[16] Developers must use diverse datasets during the training process and conduct extensive testing to ensure the algorithms’ fairness. Only by directly addressing and resolving these issues can society truly benefit from facial detection software while minimizing the risk of reinforcing existing societal biases.
The path to eliminating bias in facial detection technology is challenging. Still, it is a necessary endeavor if we aim to create a system that serves all of humanity fairly and effectively. It requires a collaborative effort from researchers, developers, policymakers, and society to ensure that principles of accountability, fairness, and transparency guide the development and deployment of this technology.
Key Takeaways
- Computer vision is a field of AI that enables machines to interpret and understand visual information from the world.
- Computer vision systems rely on key components, including sensors, data preprocessing, and algorithms for image analysis.
- Techniques like image classification, object detection, and image segmentation help computers analyze and interpret visual data.
- Computer vision is applied in various fields, including facial recognition, autonomous vehicles, and medical imaging.
- Computer vision systems face challenges such as occlusion, varying lighting conditions, and processing large volumes of visual data.
Exercises
- Develop an image classification model to categorize different types of images.
- Implement an object detection algorithm to identify and label objects in images.
- Research and present on a specific application of computer vision, discussing its technology and impact.
Introduction to Robotics
Learning Objectives
- Explain what robotics is and its key components.
- Describe various types of robots and their applications.
- Discuss how AI technologies are integrated into robotics.
- Identify the current challenges in robotics.

I don’t know about you, but I’ve been waiting for my personal home robot since I saw 2001’s Steven Spielberg’s A.I. Artificial Intelligence. The idea of having a robot that helps around the house and mows the yard sounds pretty darn amazing to me. Jason’s first robot was a SONY Aibo knock-off back in the early 2000s. The Aibo was a robotic dog that barked, did tricks, and got a little mad at you when you didn’t pay it enough attention (glowing red eyes). The Aibo (and its cheaper cousins) had sensors on the head and body, so it knew when you took the time to pet it or just left it alone. It was a fun toy and great for apartment living where I didn’t have a real dog. Admittedly, once I got a real dog, the fun of the fake Aibo kind of went out the window.
Over the past few years, we’ve seen a log of major breakthroughs in robotics. It seems like every week now a new robot is being unveiled on social media. Although my dream of a personal home robot to help with the chores isn’t here yet, that future definitely seems like it’s just around the corner. Not that long ago, the prediction was that humanoid robots would be in our homes by 2050,[17] but that estimate has already dropped to 2030 by many tech insiders.[18]
Robotic Hardware

When we think of robots, we often picture the physical machines that move and interact with the world around them. But have you ever wondered what makes these robots tick? It all comes down to their hardware—the tangible components that enable them to sense, move, and perform various tasks. In this section, we’ll explore three key aspects of robotic hardware: sensors, actuators, and locomotion systems.
Sensors
Sensors are essentially robots’s eyes and ears. Imagine you’re walking through a crowded street. You use your eyes to navigate, avoiding obstacles and pedestrians. You hear the sounds of cars honking and people chatting. Similarly, robots use sensors to gather information about their surroundings.
Robots can employ various types of sensors. For example, cameras act as their eyes, allowing them to perceive visual information. Microphones serve as their ears, enabling them to detect and interpret sounds. Other sensors, like infrared or ultrasonic sensors, help robots measure distances and detect nearby objects, much like bats use echolocation to navigate in the dark.
Actuators
Actuators are akin to muscles for robots. Human muscles enable us to move and interact with objects; actuators are the components that allow robots to perform physical actions. Actuators convert energy, typically electrical energy, into motion.
There are three main types, each powered differently and suited to different tasks:
- Electric: They use electric motors and are precise and quick, perfect for tasks needing speed and accuracy, like in assembly lines or medical surgeries.
- Hydraulic: These use fluid pressure to create strong movements, making them ideal for heavy lifting or construction robots.
- Pneumatic: Powered by compressed air, they are simple, lightweight, and cost-effective, making them ideal for repetitive tasks like packaging.
The kind of actuator chosen affects a robot’s performance significantly. Advanced robotics might combine actuators with sensors for smarter, more precise movements. It’s also important to mention that materials used in actuators (e.g., metals, plastics, or composites) impact their effectiveness and durability.
Locomotion Systems
When most people think about robots, they probably think of a robot on two legs like those portrayed in movies like Bicentennial Man, Chappie, i,Robot, The Terminator, and Transformers. Most of the robots in these movies mimic humans and walk on two legs. Today, however, there are a range of different locomotion systems available to robot engineers. Locomotion systems are the components that enable robots to move from one place to another. The choice of locomotion system depends on the robot’s intended environment and purpose.
Walking Robots
Walking robots use legs to move, mimicking the gait of humans or animals. They can navigate uneven terrain and climb stairs, making them suitable for exploration and search-and-rescue missions. Examples include humanoid robots and multi-legged robots like quadrupeds and hexapods. One of the biggest names in robotics, Boston Dynamics, has been working on both bipedal robots and robotic “dogs” over the years and have really made a name for themselves. In 2023 and 2024, we’ve seen a lot of growth in this type of robotics from:
- Apptronik’s Appolo
- Boston Dynamic’s Atlas 2
- OpenAI’s Figure
- Sanctuary AI’s Phoenix
- Tesla’s Optimus
- UniTree’s H1
Rolling Robots

Rolling robots use wheels or tracks to move across surfaces. They are efficient and easy to control on flat, smooth terrain. Examples include wheeled robots like Roomba vacuum cleaners and tracked robots used in industrial settings. Many humanoid robots use wheels instead of legs to get around. This is the classic Rosie the Robot form from the 1960s cartoon The Jetsons. One of the first commercially available humanoid robots, SoftBank Robotic’s Pepper, moves around on wheels.
Hoping Robots
Hopping robots use a spring-loaded mechanism to jump and bounce. By adjusting the force and direction of their hops, they can overcome obstacles and navigate rough terrain. Examples include the University of California at Berkley’s hopping robot and the Salto jumping robot.[19]
Metachronal Robots
Some multi-legged robots use metachronal motion, a form of locomotion that takes its cues from the movements of centipedes and millipedes. The legs move in a wave-like pattern, with each leg moving slightly after the one before it. This type of motion allows the robot to maintain stability and navigate uneven surfaces. EPFL developed an intriguing early prototype of this type of motion to resemble how salamanders move, which would be a great way for robots to enter and exit water.
Slithering Robots
Slithering robots mimic the movement of snakes, using undulating motions to propel themselves forward. They can navigate through narrow spaces and uneven terrain, making them useful for inspection and search-and-rescue applications. Examples include snake-like robots developed by various research institutions. NASA’s Jet Propulsion Laboratory has created a slithering-type robot named EELS that could be great for exploring other planets.
Swimming Robots
Swimming robots are designed to move through water using propulsion systems like propellers, fins, or jets. They can be used for underwater exploration, monitoring, and maintenance tasks. Examples include autonomous underwater vehicles (AUVs) and robotic fish.
Brachiating Robots
The movement of primates, particularly gibbons, who swing from branch to branch using their arms, serves as inspiration for brachiating robots. These robots use grippers or hooks to swing from one handhold to another, allowing them to traverse complex, three-dimensional environments. Brachiating robots are still largely in the research phase, with potential applications in forestry, construction, and space exploration. Imagine a monkey swinging through the trees from one arm to the next as they propel themselves forward. This is kind of what a brachiating robot looks like. The German Research Center for Artificial Intelligence has created a robot named RicMonk that can go across monkey bar-like structures with ease.
Aerial Robots
Aerial robots, commonly known as drones, utilize aviation technologies to achieve flight. These robots leverage rotors, propellers, or jet propulsion, to move through the air. Their ability to hover, ascend, and travel quickly across distances offers significant advantages for tasks such as aerial surveillance, delivery services, and environmental monitoring.
Consumer drones, like those from DJI or Autel Robotics, have popularized aerial photography. More specialized applications, like the Zipline drones, deliver medical supplies in remote areas or maybe dinner to your front door. Aerial robots also play critical roles in scientific research and disaster response, where their high mobility and ability to access hard-to-reach areas are invaluable. Their development continues to push the boundaries of what is possible in robotics and aviation.
Hybrid
Hybrid robots combine multiple locomotion systems to create versatile machines that can adapt to various environments. For example, a robot might have legs for walking on uneven terrain and wheels for efficient movement on flat surfaces. Hybrid robots can also combine different types of locomotion, such as swimming and crawling, to navigate through complex environments like underwater caves or pipelines. By leveraging the strengths of different locomotion systems, hybrid robots can tackle a wide range of tasks and challenges.
Robotic Software

Have you ever played a video game where you controlled a character’s movements and actions? Just like how games have underlying code that governs the character’s behavior, robots rely on software to operate effectively. Robotic software is the brain behind the machine, telling it how to interpret sensor data, plan its movements, and execute tasks. In this section, we’ll explore three crucial aspects of robotic software.
Robot Operating System
Just like your home computer needs an operating system to function (e.g., Windows, iOS, Unbuntu, Apache, etc.), robots require a complex operating system to function. Robotic operating systems play a pivotal role in the world of robotics by providing the tools and libraries to develop complex and robust robotic applications. These specialized operating systems handle the concurrent processes that operate in a wide range of robotic hardware. This includes managing sensor data, controlling actuators, and enabling autonomous decision-making.
ROS (Robotic Operating System) stands out as a prime example. It’s an open-source platform that empowers researchers and developers by offering tools and community support to foster the development of advanced robotic applications. Its widespread adoption and active community have made it a cornerstone in the progression of robotic technology.
Motion Planning
When you want to reach for a cup of coffee on your desk, your brain automatically plans the path your arm should take to grasp the cup without knocking over other objects. Robots also need to plan their movements to avoid collisions and achieve their goals efficiently. This is where motion planning comes into play.
Motion planning algorithms help robots determine the best route to get from point A to point B. These algorithms consider factors like the robot’s size, the location of obstacles, and the desired destination. They create a map of the environment and generate a path that the robot can follow, much like how you might use a GPS navigation system to find the shortest route to your destination.
Control Systems
Human brains provide a wide range of complex stimuli so we can have fine motor skills. Whether you’re drawing a picture, typing a book report, or simply picking up an object, our brains send signals to our extremities on how they need to behave. In fact, most of the time, we don’t even think about how this works unless we have a hard time doing something. Let’s look at a real-world example. Imagine it’s breakfast and you’re wanting to make scrambled eggs. You reach into the refrigerator and pull out a carton of eggs. You open the carton and select one of the eggs. Do you consciously think about how much pressure you need to put on the egg when lifting it out of the container? Do you consciously tell yourself that you can’t use the same amount of pressure when picking up an egg as you would when shaking someone’s hand? Of course not. Our brains know that eggs are fragile, so we automatically adjust how we pick up an egg based on prior learning. Robots face a similar challenge when executing movements. They need to maintain precise control over their actuators to achieve the desired motion, so control systems help them determine the amount of physical pressure that is necessary when lifting an egg out of a carton or lifting a heavy, less fragile object.
Control systems ensure that robots move smoothly and accurately. They continuously monitor the robot’s position and compare it to the desired trajectory. If there are any deviations, the control system adjusts the actuator inputs to correct the robot’s movement, much like how a thermostat regulates the temperature in your home by turning the heating or cooling on and off.
Types of Robots
Robots come in all shapes and sizes, each designed to perform specific tasks and navigate different environments. From the robotic vacuum cleaner that keeps your floors spotless to the industrial robot that assembles cars with precision, there’s a wide variety of robots out there. In this section, we’ll explore six common types of robots: autonomous mobile robots, automated guided vehicles, articulated robots, humanoids, cobots, and hybrids.
Autonomous Mobile Robots (AMRs)
Imagine a robot that can independently navigate through a warehouse, picking up and delivering packages without any human intervention. That’s the role of AMRs. They are equipped with an array of sensors and advanced software to create dynamic maps of their surroundings and calculate the most efficient routes to their destinations.
AMRs are akin to the intrepid explorers of the robot world. They autonomously navigate through dynamic environments, skillfully avoiding obstacles and dynamically adapting to changes. Much like seasoned explorers adept at charting unknown territories, AMRs efficiently traverse complex spaces like warehouses, factories, and even outdoor settings without getting lost or colliding with unexpected obstacles.
Recent advancements include robots like Locus Robotics’ LocusBot, which optimizes picking processes in warehouses, and Boston Dynamics’ Stretch, designed to handle package moving and management in logistics centers. Because they are more efficient and require less human work, these robots are revolutionizing a variety of industries that require repetitive tasks.
Automated Guided Vehicles (AGVs)
Conversely, AGVs operate more like obedient pets, adhering strictly to predefined paths. These robots navigate using guidance systems, such as magnetic strips, QR codes, or lasers, reflecting their dependency on installed guides to maneuver through their environment.
AGVs are the workhorses of structured environments like manufacturing plants and large-scale distribution centers. They excel at performing repetitive, high-volume tasks with great reliability. A classic example is Toyota’s range of AGVs, which provide material transport solutions across its manufacturing plants, showcasing their capability to streamline logistics and supply chain operations. America in Motion has a series of AGV’s that fall into basic categories: forklifts, clamps, loaders, lifts, and draggers.
Articulated Robots
Articulated robots are invaluable in manufacturing and assembly. Equipped with multiple rotating joints, these robots can perform complex tasks requiring high levels of precision and flexibility. They can weld, paint, assemble, and handle materials, among other tasks.
These robots resemble the acrobats of the robot world, capable of a range of motions that mimic human arms. The most prominent examples include the KUKA KR QUANTEC and the ABB IRB 6700, both of which can perform tasks requiring intricate movements and continuous operation in industrial environments. When most people think of industrial robots, they think of these types. They look like giant robotic arms or a pair of arms that perform specific tasks.
Humanoids

Humanoid robots mirror human form and functionality, designed to interact with tools and environments crafted for human use. They feature bipedal locomotion and can perform tasks ranging from simple household chores to complex interactions involving speech and object recognition.
Prominent examples include Honda’s ASIMO, one of the first advanced humanoids capable of walking and navigating stairs independently, and Boston Dynamics’ Atlas, which can run, jump, and even perform backflips. Humanoid robots are often used in research, education, and increasingly in customer service roles, where they can engage directly with consumers. Most of the walking robots we discussed earlier are humanoid.
Cobots
Collaborative robots, or cobots, work collaboratively with humans within shared spaces without safety cages. These robots are equipped with groundbreaking sensors and software that ensure safe interactions, making them ideal for direct cooperation with human operators.
Universal Robots’ UR series is a leading example, extensively used in many industries for jobs like assembly, painting, and screw driving. These robots are particularly valued for their capacity to collaborate with human laborers, increasing efficiency while maintaining high safety standards.
Hybrids
Hybrid robots merge functionalities from various robot types to handle diverse tasks and adapt to multiple environments. These robots may combine the autonomy of AMRs, the dexterity of articulated robots, and the safety features of cobots, creating highly versatile systems capable of addressing a wide range of industrial challenges.
An example of such innovation is Boston Dynamics’ Spot, which can be fitted with different payloads for tasks like remote monitoring, data collection, or manipulation. This showcases its adaptability across various sectors, including construction, oil and gas, and public safety.
Key Takeaways
- Robotics involves the design, construction, and operation of robots, integrating mechanical, electronic, and software components.
- Robots can be classified into industrial, service, and autonomous robots, each with specific applications in manufacturing, healthcare, logistics, and more.
- AI technologies, such as ML and computer vision, enhance robots’ capabilities in navigation, manipulation, and human-robot interaction.
- Robotics faces challenges in achieving precise navigation, effective manipulation, and safe interaction with humans, requiring continuous advancements in technology.
Exercises
- Design a simple robot with specific functionalities, such as obstacle avoidance or object manipulation.
- Use a robotics simulation software to program and test a robot’s behavior in a virtual environment.
- Research and present on a specific application of robotics, discussing its technology and impact.
Chapter Wrap-Up
In this chapter, we explored the core concepts and capabilities of AI, delving into algorithms, data structures, and programming languages. We evaluated the technical aspects of AI by exploring key subfields such as ML, deep learning, NLP, computer vision, and robotics. We examined the foundational concepts, significant techniques, and practical applications of each subfield. The chapter highlighted how AI technologies are transforming various industries, from healthcare and finance to autonomous systems and language translation. Through understanding these technical insights, we gain a comprehensive view of the current capabilities and future potential of AI, setting the stage for further exploration and innovation in this rapidly growing field.
Key Terms
- Algorithm
- Anomaly Detection
- Artificial Neural Networks (ANNs)
- Autoencoders
- Clustering
- Computer Vision
- Context Window
- Convolutional Neural Networks (CNNs)
- Data Structures
- Deep Learning
- Dimensionality Reduction
- Generative Adversarial Networks (GANs)
- Image Classification
- Machine Learning (ML)
- Named Entity Recognition (NER)
- Natural Language Processing (NLP)
- Object Detection
- Part-of-Speech Tagging (POS Tagging)
- Recurrent Neural Networks (RNNs)
- Reinforcement Learning (RL)
- Robotics
- Sentiment Analysis
- Supervised Learning
- Tokenization
- Transfer Learning
- Unsupervised Learning
Chapter Exercises
- Imagine you are tasked with designing an AI system to assist doctors in diagnosing rare diseases. What type of AI technology would you use (e.g., ML, NLP, computer vision, etc.), and how would you train and evaluate the system to ensure its accuracy and reliability? Discuss the potential benefits and challenges of implementing such a system in a healthcare setting.
- Choose a specific application of generative AI (e.g., text generation, image synthesis, music composition) and research the current state-of-the-art models and techniques used in that domain. Create a presentation or write a report that explains how these models work, showcases some examples of their outputs, and discusses the potential implications and ethical considerations of using generative AI in creative industries.
- Design and conduct a survey or interview study to investigate public perceptions and attitudes toward AI technologies, such as facial recognition, autonomous vehicles, or chatbots. Analyze the results to identify common themes, concerns, and expectations, and propose strategies for addressing these issues through responsible AI development and deployment practices.
- Develop a proposal for an interdisciplinary research project that combines AI with another field of study (e.g., psychology, economics, environmental science) to tackle a specific real-world problem. Outline the project’s goals, methodology, and expected outcomes, and discuss how AI could be used to advance knowledge and create positive impact in the chosen domain.
Real-World Case Study
DeepMind’s AlphaFold and Protein Structure Prediction
Predicting the 3D structure of proteins is a crucial challenge in biology and drug discovery. Google’s DeepMind, a leading AI research company, developed AlphaFold, an AI system that uses deep learning to predict protein structures with unprecedented accuracy.[20]
In the 2020 Critical Assessment of Protein Structure Prediction competition, AlphaFold outperformed all other methods, achieving a median accuracy score of 92.4 out of 100. This breakthrough has significant implications for understanding diseases, designing new drugs, and advancing scientific research.
In 2024, DeepMind released the AlphaFold 3 Model, which is set to revolutionize modern medicine. AlphaFold 3 has the ability to accurately predict how drugs and other therapeutic molecules interact with proteins, which is essential for drug discovery.[21]
Questions:
- How can AI revolutionize scientific discovery and accelerate research in fields like biology and chemistry?
- What are the potential applications of accurate protein structure prediction in drug development and personalized medicine?
- How can collaboration between AI researchers and domain experts drive innovation and solve complex real-world problems?
The Falsely Identified Man
Randal “Quran” Reid was traveling to his mother’s house the day after Thanksgiving when Georgian police unexpectedly stopped and detained him. The officers informed him that he was wanted for crimes in Louisiana, a state he had never visited. Quran was taken to jail and spent several days locked up without clear information or a timeline for resolution from the authorities.
The arrest was based on the misuse of facial recognition technology by a sheriff’s detective in Jefferson Parish, Louisiana. Detective Andrew Bartholomew had used surveillance video from a store outside New Orleans where a stolen credit card was used to buy two expensive purses. The facial recognition software identified Randal Reid as the suspect involved in the theft. So, Bartholomew, relying on no other evidence, filed an arrest warrant for Quran.
Quran spent several challenging days trying to understand and prove his innocence from behind bars, but was told in Georgia there wasn’t much they could do because the warrant was issued in a different state. It wasn’t until his family hired an attorney in Louisiana, who presented photos and videos proving Quran’s presence in Georgia at the time of the theft, that the mistake was acknowledged. The sheriff’s office eventually requested the judge withdraw the warrant, and Quran was released.
The wrongful arrest had significant repercussions for Reid, not only emotionally but also financially, as he incurred legal expenses, missed work, and faced personal distress during and after the incident. Quran’s unsettling experience highlights serious concerns about the reliability of facial recognition technology and its implications for justice and individual rights.
Discussion Questions:
- Do the benefits of facial recognition in law enforcement outweigh the risks? Consider how factors such as efficiency, public safety, and crime prevention might be enhanced by this technology.
- How might biases present in facial recognition algorithms lead to disproportionate impacts on specific communities? Discuss the potential consequences for fairness and civil liberties, and explore examples where such biases have manifested.
- What types of safeguards and regulations should be implemented to prevent the misuse of facial recognition technology by law enforcement agencies? Consider aspects such as transparency, oversight, and accountability in your discussion.
- How can law enforcement agencies strike a balance between leveraging the benefits of facial recognition for public safety and upholding individual privacy rights? Propose specific guidelines or policies that could help manage this balance responsibly.
- How do incidents of misidentification, like the case of Randal Quran Reid, highlight the need for accuracy and reliability in the use of facial recognition technology? Discuss the potential human cost of errors and the impact on trust in law enforcement.
- What are the ethical and legal challenges posed by the use of facial recognition technology in criminal investigations? Explore how these challenges could be addressed to ensure ethical compliance and protect civil rights.
- Consider the impact of facial recognition technology on community-police relations. How might negative experiences with this technology affect public perceptions and trust in law enforcement?
- What future developments in facial recognition technology could help mitigate risks and improve its reliability and fairness? Discuss potential technological advancements or research areas that could contribute to these improvements.
End-of-Chapter Assessment
Discussion Questions:
- How do the various types of machine learning (supervised, unsupervised, reinforcement) differ in their approaches and applications?
- In what ways have deep learning models, such as CNNs and RNNs, revolutionized fields like computer vision and NLP?
- What are the ethical implications of generative AI models, such as GANs, in creating realistic synthetic data?
- How can advancements in NLP improve human-computer interaction and accessibility?
- What challenges do researchers face in developing and deploying autonomous robots in real-world environments?
Multiple Choice Questions
1. What is the primary goal of machine learning?
A) To replace human workers
B) To improve computer performance using data
C) To create physical robots
D) To design video games
2. Which type of learning uses labeled data to train algorithms?
A) Unsupervised learning
B) Reinforcement learning
C) Supervised learning
D) Deep learning
3. What is the key characteristic of deep learning models?
A) They use rule-based systems
B) They are shallow networks
C) They learn hierarchical representations with multiple layers
D) They require no data for training
4. Which architecture is specifically designed for processing sequential data?
A) CNNs
B) RNNs
C) GANs
D) Autoencoders
5. In a GAN, what is the role of the discriminator?
A) To generate synthetic data
B) To encode input data
C) To decode latent representations
D) To distinguish between real and generated data
6. Which technique in NLP involves breaking down text into smaller units?
A) Tokenization
B) POS Tagging
C) NER
D) Sentiment Analysis
7. What is the primary application of image classification in computer vision?
A) Translating text
B) Categorizing images into predefined classes
C) Detecting objects within images
D) Generating synthetic images
8. Which deep learning model is used for unsupervised learning and representation learning?
A) CNNs
B) RNNs
C) GANs
D) Autoencoders
9. What is the main goal of reinforcement learning?
A) To classify data into categories
B) To discover patterns in unlabeled data
C) To learn through trial and error interactions with the environment
D) To process and understand human language
10. Which field involves the design, construction, and operation of robots?
A) Natural Language Processing
B) Computer Vision
C) Robotics
D) Machine Learning
True or False Questions
- Machine learning can improve computer performance without explicit programming.
- Supervised learning uses unlabeled data to train algorithms.
- Deep learning models typically require small datasets to achieve high performance.
- CNNs are designed specifically for processing textual data.
- GANs consist of two neural networks: a generator and a discriminator.
- Tokenization is a technique used in computer vision to detect objects in images.
- Image classification involves categorizing images into predefined classes.
- Autoencoders are used for unsupervised learning and representation learning.
- Reinforcement learning involves learning through trial and error interactions with the environment.
- Robotics does not integrate AI technologies like machine learning and computer vision.
Answer Key
Discussion Questions
1. How do the various types of machine learning (supervised, unsupervised, reinforcement) differ in their approaches and applications?
Example Answer: Supervised learning uses labeled data to train algorithms to map inputs to outputs, useful in tasks like classification and regression. Unsupervised learning works with unlabeled data to discover hidden patterns, useful in clustering and dimensionality reduction. RL involves an agent learning to make decisions through trial and error by interacting with an environment, receiving rewards or penalties, used in robotics and game playing.
2. In what ways have deep learning models, such as CNNs and RNNs, revolutionized fields like computer vision and NLP?
Example Answer: CNNs have transformed computer vision by enabling accurate image classification, object detection, and segmentation. RNNs have revolutionized NLP by allowing models to understand and generate sequential data, improving tasks like language translation, sentiment analysis, and speech recognition.
3. What are the ethical implications of generative AI models, such as GANs, in creating realistic synthetic data?
Example Answer: Ethical implications include the potential for misuse in creating deepfakes, the challenge of distinguishing between real and synthetic data, issues of privacy and consent, and the need for regulations to prevent malicious use while promoting beneficial applications.
4. How can advancements in NLP improve human-computer interaction and accessibility?
Example Answer: NLP advancements can improve virtual assistants’ ability to understand and respond to natural language, enhance accessibility tools for individuals with disabilities through better speech recognition and text-to-speech systems, and facilitate more intuitive and efficient human-computer interactions.
5. What challenges do researchers face in developing and deploying autonomous robots in real-world environments?
Example Answer: Challenges include ensuring reliable and safe navigation in dynamic environments, achieving precise manipulation and interaction with objects, dealing with unpredictable human behaviors, and integrating advanced AI technologies for real-time decision-making and adaptation.
Multiple Choice Questions
1. What is the primary goal of machine learning?
Answer: B. To improve computer performance using data.
2. Which type of learning uses labeled data to train algorithms?
Answer: C. Supervised learning
3. What is the key characteristic of deep learning models?
Answer: C. They learn hierarchical representations with multiple layers
4. Which architecture is specifically designed for processing sequential data?
Answer: B. RNNs
5. In a GAN, what is the role of the discriminator?
Answer: D. To distinguish between real and generated data
6. Which technique in NLP involves breaking down text into smaller units?
Answer: A. Tokenization
7. What is the primary application of image classification in computer vision?
Answer: B. Categorizing images into predefined classes
8. Which deep learning model is used for unsupervised learning and representation learning?
Answer: D. Autoencoders
9. What is the main goal of reinforcement learning?
Answer: C. To learn through trial and error interactions with the environment
10. Which field involves the design, construction, and operation of robots?
Answer: C. Robotics
True or False Questions
- True: Machine learning uses data to enhance computer performance by detecting patterns and making decisions autonomously.
- False: Supervised learning uses labeled data to train algorithms, while unsupervised learning uses unlabeled data.
- False: Deep learning models usually require large datasets to learn hierarchical representations effectively.
- False: CNNs are designed for processing grid-like data, such as images, not textual data.
- True: GANs include a generator that creates synthetic data and a discriminator that distinguishes between real and generated data.
- False: Tokenization is an NLP technique that breaks down text into smaller units, not a computer vision technique.
- True: Image classification is a computer vision task that categorizes images into predefined classes.
- True: Autoencoders are a type of deep learning model used for unsupervised learning and representation learning.
- True: Reinforcement learning focuses on learning through interactions with an environment, receiving rewards or penalties.
- False: Robotics often integrates AI technologies such as machine learning and computer vision to enhance robots’ capabilities.
- Don, E. (2025, January 18). Why you shouldn’t learn to code in 2025. LinkedIn. https://www.linkedin.com/pulse/why-you-shouldnt-learn-code-2025-ebenezer-don-hb08f/ ↵
- Modi, P. (2025, February 14). The future of programming will not be same. - Sam Altman [Blog post]. EducationNext. https://www.educationnext.in/posts/programming-at-the-beginning-of-2025-and-at-the-end-of-2025-will-be-very-different ↵
- Samuel, A. L. (1959). Some studies in ML using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229. https://doi.org/10.1147/rd.33.0210 ↵
- Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. https://doi.org/10.1037/h0042519 ↵
- Clark, T., Towers, M., Evers, C., & Hare, J. (2024). Beyond the rainbow: High performance deep reinforcement learning on a desktop PC. ArXiv. https://arxiv.org/abs/2411.03820 ↵
- Goodfellow, I. J., Mirza, M., Xu, B., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial networks. ArXiv. https://arxiv.org/abs/1406.2661 ↵
- Nadkarni, P. M., Ohno-Machado, L., & Chapman, W. W. (2011). Natural language processing: An introduction. Journal of the American Medical Informatics Association, 18(5), 544–551. https://doi.org/10.1136/amiajnl-2011-000464 ↵
- Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google's NMT system: Bridging the gap between human and machine translation. ArXiv. https://doi.org/10.48550/arXiv.1609.08144 ↵
- Pichai, S., & Hassabis, D. (2024, February 15). Our next-generation model: Gemini 1.5. Google Blog. https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/ ↵
- Kilian, W. J., Prinsloo, J., Vosloo, J., & Taljaard, S. (2024). A synthetic segmentation dataset generator using a 3D modeling framework and raycaster: A mining industry application. Frontiers in Artificial Intelligence, 7. https://doi.org/10.3389/frai.2024.1453931 ↵
- Simonite, T. (2014). Facebook creates software that matches faces almost as well as you do: Facebook’s new AI research group reports a major improvement in face-processing software. MIT Technology Review, 117(3), 19. ↵
- Pesenti, J. (2021, November 2). An update on our use of face recognition. https://about.fb.com/news/2021/11/update-on-use-of-face-recognition/ ↵
- The Office of Inspector General at the U.S. Department of Homeland Security. (2024). TSA's modernization of identity authentication and biometric technology to enhance passenger screening at U.S. airports. https://www.oig.dhs.gov/node/7113. ↵
- Hill, K., & Kilgannon, C. (2022, December 22). Madison Square Garden uses facial recognition to ban its owner’s enemies. The New York Times. https://www.nytimes.com/2022/12/22/nyregion/madison-square-garden-facial-recognition.html ↵
- Yücer, S., Tektas, F., Moubayed, N., & Breckon, T. (2023). Racial bias within face recognition: A survey. ArXiv, abs/2305.00817. https://doi.org/10.48550/arXiv.2305.00817. ↵
- Lunter, J. (2020). Beating the bias in facial recognition technology. Biometric Technology Today, 2020(9), 5–7. https://doi.org/10.1016/s0969-4765(20)30122-3 ↵
- Young, C. (2022, April 18). Elon Musk: There will likely be a humanoid robot in every home by 2050. https://interestingengineering.com/innovation/robots-musk-in-homes ↵
- Toews, R. (2024, March 10). 5 AI predictions for the year 2030. Forbes. https://www.forbes.com/sites/robtoews/2024/03/10/10-ai-predictions-for-the-year-2030 ↵
- Yim, J. K., Wang, E. K., & Fearing, R. S. (2019). Drift-free roll and pitch estimation for high-acceleration hopping. 2019 International Conference on Robotics and Automation, 8986–8992. https://doi.org/10.1109/ICRA.2019.8793259 ↵
- The AlphaFold team. (2020, November 30). AlphaFold: A solution to a 50-year-old grand challenge in biology. DeepMind. https://deepmind.google/discover/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology/ ↵
- Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., Bambrick, J., Bodenstein, S. W., Evans, D. A., Hung, C. C., O’Neill, M., Reiman, D., Tunyasuvunakool, K., Wu, Z., Žemgulytė, A., Arvaniti, E., Beattie, C., Bertolli, O., Bridgland, A., Cherepanov, A., Congreve, M., ... Jumper, J. M. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630, 493–500. https://doi.org/10.1038/s41586-024-07487-w ↵

