Large Language Models

I started this document - section 1 - all me. I used AI to validate and make improvements and correct my public school education. I then fed the document to ChatGPT and asked it to finish section 2 for me. I had to ask it to change up the examples, but effectively I started the document and am / will use ChatGPT to speed the completion of it.

Table of Contents

  1. How They Work

    1. Pattern Recognition
    2. Tokenization
    3. Attention
    4. Context
    5. Messages
  2. Prompts

    1. Markdown Formatting
    2. Single Shot
    3. Multi Shot
    4. Resource Augmentation
  3. Examples

    1. Changing Tone
      • Problem: The following sounds offensive for a risk statement on a status deck.
        ""
    2. Validation & Fixing
    3. Code Generation
  4. Programming with AI

    1. REST APIs
    2. Programming Libraries
  5. Embeddings

    1. Creation
    2. Use
    3. Resource Augmented Generation (RAG)
  6. Training

1. How They Work

Artificial intelligence, specifically large language models, function as pattern recognition and prediction engines. They learn through training (recognizing patterns) and generate responses based on probabilities (prediction). The following is a bad example because it's biased toward U.S. citizens and those born and raised in the USA. If I ask you to complete "We the People..." you'll say "of the United States of America...". What just happened?

Bias Beware

Let's not lose focus on the "bias" the example just stated. When we say that AI picks up on bias, this is a perfect case. AI trained on American English will pick up on American bias. Many national constitutions express “We the People” in different ways, emphasizing the sovereignty of the people. India, South Africa, and Japan explicitly use “We the People,” while others (France, Germany) state the same idea in different words. This "bias" toward America can raise its head in many ways.

1.1. Pattern Recognition & Completion

Through repetition, you've identified a pattern that "We the People" is most likely followed by "of the United States of America" because of its repeated use in history. You were trained by repetition just like we train artificial intelligence models. But notice that we also asked you to complete the sentence—make a prediction about what comes next. This works through pattern recognition:

While the first two are clear patterns, and we showed that through repetition the association and pattern of the U.S. Constitution is established. The same happens with number 4—through reading and training on science material, it identified the patterns to formulate a response to the prompt. Completion refers to how an AI predicts and generates text based on prior input.

1.2. Tokenization

Attention considers the relationships between items. When a machine "reads" text, it breaks it into parts known as tokens. Early renditions broke text into tokens based on word boundaries, or whitespace boundaries. Consider the following:

"We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America."

The words we see are: we, the, people, of, United, States, in, Order, to, form, a, more, perfect, Union, establish, justice... But we also need to account for the impact of punctuation—so those must be tokens as well because a comma (,) conveys different meaning than a period (.) in text. To make it more flexible and account for things it might not have encountered before, we need to further decompose the words.

This should be two tokens: work, and ing.

We would again seek to tokenize in a way that gives better understanding. For example:

Where "simpl" implies the universal meaning of all the permutations of the word "simple." While "icity" is a state of being, "ification" is a process. Note that these new tokens have meaning in themselves, and the pattern can be applied to more than just "simpl."

In reality, once the logic has been applied, tokenization results in tokens averaging ~3-4 characters per token. Let's consider: "the car is next to the house in the garage."

The tokens would be: ["the", " ", "car", "is", "next", "to", "house", "in", "garage"]. The tokenization of the original string would be represented as the following token array: [0, 1, 2, 3, 4, 5, 6, 7, 8] assuming that arrays are 0-indexed. This array of tokens is what artificial intelligence (AI) is trained on—tokenized text. So an AI model is directly tied to the tokenization model. Do not use a tokenization engine or model that your AI model was not trained on—you'll get garbage.

1.3. Attention & Weights

When we consider language, the order and proximity of words to one another have an impact on each other. This relationship and impact on meaning is discussed as attention. For example:

In one sentence, we have a dog named Red that's sleeping. In the other, we have a dog of red color that is sleeping. Note the proximity difference—an adjective precedes the noun. "Pronoun the proper noun" assigns an identifier.

This is how attention works—one token's relationship with another token directs meaning. Attention is augmented by the concept of "weight," where weight is how strong the relationship or attention is.

1.4. Context

Attention can be considered the impact of a token on more immediate tokens. But in the larger corpus of language and conversations, the overall meaning is related. For example:

In these sentences, the context is clear—we are talking about a living thing—not an inanimate object made of wood. Now consider:

In isolation from the first example, it's clear by the context of the first sentence that "it" is a bat, and not a living animal. In AI, the context is the conversation history. This is important because the "context" is also the conversation memory or history. One important factor is that the more context we can provide, the more information we can provide, and the better response we can get—especially with complicated topics.

One thing to avoid is confusion:

To most people, they would be confused by that sentence.

Also, consider the context to be a storage limit. If the context is limited to 100 tokens, and a token is about 3 characters, then we can only send 300 characters of information. It would be like talking to someone with immediate amnesia.

By not having a large context, we are limited in the amount we can communicate. The more tokens our context can handle means we can feed it more information to complete our conversation. The following is the history of context sizes:

Year Model Context Window Size
2018 BERT, GPT-1 512 tokens
2019 GPT-2 1,024 tokens
2020 GPT-3 2,048 tokens
2023 GPT-3.5 4,096 tokens
2023 GPT-4 8,192 to 32,768 tokens
2023 Claude 2 100,000 tokens
2024 Claude 2.1 200,000 tokens
2024 GPT-4 Turbo 128,000 tokens
2024 Gemini 1.5 2,000,000 tokens
2024 Llama 3 128,000 tokens
2024 DeepSeek-V2 128,000 tokens
2024 Magic.dev’s LTM-2-Mini 100,000,000 tokens

When we get into resource-augmented generation, you'll see why context lengths are important. But note the growth in context size over the years—this could be driven by the fact that prompts and context will get you the results you want faster, easier, and with less money and energy than training will.

1.5. Messages

When you get into working with large language models, the model literally works to "complete" the text. But its training history used specific formats: this is the message formats the model was trained on. At a high level, a message is two parts: the prompt and the completion. The prompt was a message sent to the engine, and the completion is its response.

The roles are defined by your engine while it was trained. Generally, there are three roles:

Some models support system and some do not. The following user/assistant prompt can achieve the same result as a system message:

2. Prompts

Your model was trained on a large dataset. This training builds generalizations. Memorized facts are harder to establish and best provided in reference resources at the time they're needed—unless it's a well-known number like "pi." Training is expensive: it requires lots of memory and compute time. It should not be your first choice at getting the specific results you need—most of the time, using a prompt designed to give you the response you want is more effective and costs less.

2.1. Markdown Formatting

Markdown is often used to provide structured input to language models. It helps maintain consistency in the formatting of prompts, making it easier for the model to parse and generate coherent responses. For example:

This document was originally composed in markdown. When ChatGPT, and most large language models, respond, they themselves do so in markdown. The user interface will render that in a pretty format for human consumption.

When you are providing more complicated directions, you can make numbered steps as a numbered list in markdown. When you are giving code examples, use a fenced code block. By doing this, you increase the quality of the response by communicating more accurately.

2.2. Single Shot

In single-shot learning, the model is given a single example or instance to generate a response. This is useful when you want the model to understand a task from a limited example rather than a full dataset.

2.3. Multi Shot

In multi-shot learning, the model is provided with multiple examples to learn how to perform a task. This helps in giving the model more context and understanding of what is expected.

The power of a multi shot prompt is that they can build on each other, or build up a context that can be later referenced.

2.4. Resource Augmentation

Resource augmentation refers to the process of providing additional resources or context to the model to improve its performance. This can be done by enriching the input with supplementary data, references, or even external databases to help the model generate more accurate and context-aware responses.

Example:

3. Examples

I'll give some real world examples.

3.1. Reviewing and Editing

While creating this document I made the table of contents and then wrote section one. This laid the foundation for diction and style. Once I was done with section 1, I fed the document to ChatGPT as follows:

By working you prompts in this way, we are using a multi shot resource augmented generation approaches.

3.2. Changing Tone

I was working on a project risk register adding the following:

Business units maybe unable/unwilling to change their policies or procedures to accommodate our objectives.

The problem is dissemination documents stating that business units, co workers, may be unwilling or unable. While the concern is valid, the wording and tone needs softened as we need to build bridges and not walls. To this effect, we used the following prompts to get to an acceptable result:

3.3. Code Reviews

3.4. Code Generation

This starts to get much more complicated. When I run my code generator, it uses something around 60K tokens. That's about 240K of code. Here's the prompt chain used in Aici to generate it's self. One thing to note is that this is in JSON format matching the message arrays used by LLM REST APIs.

[
  {
    "role": "user",
    "content": "
        <!! PROJECT EXAMPLE_SQL ~/database/0.0.1/tables/passwords.sql !!/>
        <!! PROJECT EXAMPLE_DTO ~/common/src/tre/models/PasswordDto.ts !!/>
        
        The `%EXAMPLE_DTO%` was created as a data transfer object for `%EXAMPLE_SQL%`.
        The language is TypeScript.
        The data transfer object will serve as an interface for a TypeORM entity.
        
        <!! PROJECT MY_SQL %MY_SQL% !!/>
        
        Create a data transfer object for `%MY_SQL%` like we did for `%EXAMPLE_SQL%` and `%EXAMPLE_DTO%`.
        
        The file you need to create will be saved to `~/common/src/app/models/`.
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_DTO"
  },
  {
    "role": "user",
    "content": "
        <!! PROJECT EXAMPLE_ENTITY ~/backend/src/tre/data/PasswordEntity.ts !!/>
        
        The `%EXAMPLE_ENTITY%` entity was created for `%EXAMPLE_DTO%`.
        
        Create an entity for `%MY_DTO%` like we did for `%EXAMPLE_ENTITY%`.
        
        The file you need to create will be saved to `~/backend/src/app/data/`.
        
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_ENTITY"
  },
  {
    "role": "user",
    "content": "
        <!! PROJECT EXAMPLE_REPO ~/backend/src/tre/data/PasswordRepository.ts !!/>
        
        The repository in `%EXAMPLE_REPO%` was created for `%EXAMPLE_ENTITY%`.
        
        The file you need to create will be saved to `~/backend/src/app/data/`.
        
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_REPO"
  },
  {
    "role": "user",
    "content": "
        <!! PROJECT EXAMPLE_REPO ~/backend/src/tre/services/PasswordService.ts !!/>
        
        The backend service `%EXAMPLE_BE_SERVICE%` was created for `%EXAMPLE_ENTITY%`.
        
        Create a backend service for `%MY_ENTITY%`.
        
        The file you need to create will be saved to `~/backend/src/app/services/`.
        
        Import `BaseService` from \"../../tre/services/BaseService\".
        Import `Logger` from \"../../tre/Logger\".
        
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_BE_SERVICE"
  },
  {
    "role": "user",
    "content": "
        <!! PROJECT EXAMPLE_BE_TESTS ~/backend/src/tre/tests/services/PasswordService.test.ts !!/>
        
        The jest end to end tests `%EXAMPLE_BE_TESTS%` were created for `%EXAMPLE_BE_SERVICE%`.
        
        Create jest end to end tests for `%MY_BE_SERVICE%`.
        
        The file you need to create will be saved to `~/backend/src/app/tests/services/`.
        
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_BE_TESTS"
  },
  {
    "role": "user",
    "content": "
        <!! PROJECT EXAMPLE_FE_SERVICE ~/frontend/src/tre/services/PasswordService.ts !!/>
        
        The frontend service `%%` was created to mirror the backend service `%%`.
        
        Create a front end service for `%%`.
        
        The file you need to create will be saved to `~/frontend/src/app/services/`.
        
        Import `FetchWrapper` from \"../../tre/services/FetchWrapper\".
        
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_FE_SERVICE"
  },
  {
    "role": "user",
    "content": "
        <!! PROJECT MY_WEBPACK ~/frontend/webpack.config.js !!/>
        
        The webpack config `%MY_WEBPACK%` contains reference to the pages that use `%EXAMPLE_FE_SERVICE%` service.
        
        Update `%MY_WEBPACK%` to include pages for `%MY_FE_SERVICE%`
        
        When adding to the webpack config:
        - if the page is a list then use the plural form of the service name
        - if the page is edit then use the singular form of the service name
        - the list page is provided for all services through the generic 'lists' entry
        - the list edit page is 'list'
        
        Only add the edit entry, not the list entry.
        
        The app-specific pages and template will be stored in directory `./src/app/`.
        
        Do not add additional comments.
        Do not reformat the code.
        Do not remove any tsx pages from entries.
        
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_WEBPACK"
  },
  {
    "role": "user",
    "content": "
        <!! PROJECT MY_BE_INDEX ~/backend/src/index.ts !!/>
        
        The backend application starting point `%MY_BE_INDEX%` includes references to `%EXAMPLE_BE_SERVICE%` backend service.
        
        Update `%MY_BE_INDEX%` to include the `%MY_BE_SERVICE%` service.
        
        Do not add additional comments.
        Do not reformat the code.
        You may remove blocks of multiple blank lines, but leave at least one.
        
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_BE_INDEX"
  },
  {
    "role": "user",
    "content": "
        <!! PROJECT MY_EDS ~/backend/src/app/data/EntitiesDataSource.ts !!/>
        
        The entities data source `%MY_BE_INDEX%` includes references to `%EXAMPLE_ENTITY%` entity.
        
        Update `%MY_EDS%` to include the `%MY_ENTITY%` entity.
        
        Do not add additional comments.
        Do not reformat the code.
        You may remove blocks of multiple blank lines, but leave at least one.
        
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_EDS"
  },
  {
    "role": "user",
    "content": "
        <!! PROJECT MY_SECURABLES ~/database/0.1.0/data/securables.sql !!/>
        
        Update the securables SQL file to include the securables referenced in `%MY_BE_SERVICE%`.
        
        Do not add additional comments.
        Do not reformat the code.
        You may remove blocks of multiple blank lines, but leave at least one.
        
        Use double quotes (\\\") instead of single quotes ('') for string literals.
        
        %SAVE_PROMPT%
    "
  },
  {
    "role": "assistant",
    "content": "Save: MY_SECURABLES"
  }
]

Let pause to explain this. Aici runs this prompt chain and read the project files from your hard drive. It will replace all <!! PROJECT NAME FILE !!/> tags with the code for that file in the following format:

File `FILE`:

` ` ` 
the files contents as escaped markdown
` ` `

And every instance of %NAME% in the messages will be replaced with FILE so we reference or code by file names. The `%SAVE_PROMPT% variables will be replaced with:


Do not provide an explanation.
Provide the output in the following format:

File name `~/folder/file.ext`:

` ` `
// code goes here
` ` `

This will create an output that when the assistance responds, we'll parse the filename and contents to save to disk. We can also reference the saved file contents by the variable name given so that the filename is injected for the generated content. This allows us to build up a context building files on top of files based upon reference implementations.