Saturday, August 24, 2024

Generative AI with public cloud

 


LLM:

  • A language model (LM) is a probabilistic model of text.

Encoders:

Models that convert a sequence of words to an embedding.

Decoders:

Models take a sequence of words and output next word.

Examples: GPT-4, Llama and bloom

Encoder - decoder module:

We passed english letters and encoder covert into token. Decode passed one tocken at time.

Hallucination:

It is generated text that is non factual and Or ungrounded.

LLM application:

Retrieval Augmented Generation (RAG)

  • Primarily used in QA where the model has access to support documents for a query.

Code models:

  • Instead of training on written language train on code and comments.

In-context learning and few shot prompting:

  • In context learning - conditions an LLM with instructions and Or demonstrate of the task.
  • K-shot prompting: Explicitly providing k examples of the intended task in the prompt.
  • F-string Or formatted string are a feature in python can be used to create prompt templates for LLM.

Language Agents:

* A Budding area of research where LLM based agents

Some notable work in the space:

* ReAct 

Iterative framework where LLM emits thoughts, then act and observes result

* Toolformer

Pre-training technique where strings are replaced with calls to tools that yield result.

OCI Generative AI service:

* Fully managed service that provides a set of customizable Large Language Models (LLM) avilable via a single API to build generative AI applications.

Generation:

Command -> Command light -> llama 2.7

Dedicated AI cluster:

* Dedciated AI cluster has a GPU based resource that host the customers fine-tuning and inference workloads.

OCI setup:

configuration file : ./oci/config

Model parameters:

Temperature : Determines how creative module should be, default temperature is 1 and maximum temperature is 5.

Length: Approximate length of the summary, choose from short, medium and length.

Embeddings:

Embedding is a numerical represent of piece of text converted into number sequences.

A piece of text could be a word, phrase, sentence or paragraph or more paragraphs.

Models creates a 1024 vector for each embedding.

Max 512 tokens per embedding.

Model create a 384 dimensional vector for each embedding.

It will be very difficult to tune a 2 billion tokens. We are using the In-context Learning/Few shot Prompting:

GPU memory is limited, so switching between models can incur significant overhead due to reloading the full GPU memory.

Dedicated AI cluster units:

* Large cohere - Dedicated AI cluster units for hosting or fine tuning the cohere command

* Small Cohere - Dedicated AI cluster units for  hosting or fine tuning for small cohere command

* Embed Cohere - Dediated AI cluster for hosting the models

* Llama2-70 model - Dedicated AI cluster for hosting the Llamba2 models

Fine tunnning is required 2 units and each cluster is active for five hours.

RAG framework:

Retriever : It is act like search engine. 

Ranker : Evalate and priorites rank based a quality of the data.

Generator : It provide  human like texts.

RAG techniques:

RAG sequence 

RAG token

RAG pipelines:

Documents -> Chunks -> Embedding -> Index [database]

Vector database:

A vector is sequence of numbers called dimensions, used to capture the important "features" of the data.

Semantic search:

It means search by meaning rather than giving a number.


No comments:

Post a Comment