Critical Factors When Orchestrating an Optimized Large Language Model (LLM)

IBL News | New York

When choosing and orchestrating an LLM, there are many critical technical factors, such as training data, dataset filtering, fine-tuning process, capabilities, latency, technical requirements, and price.

Experts state that implementing an LLM API, like GPT-4 or others, is not the only option.

As a paradigm-shifting technology and with the pace of innovation moving really fast, the LLMs and Natural Language Processing market is projected to reach $91 billion by 2030 growing at a CAGR of 27%.

Beyond the parameter count, recent findings showed that smaller models trained on more data are just as effective, and can even lead to big gains in latency and a significant reduction in hardware requirements. In other words, the largest parameter count is not what matters.

Training data should include conversations, games, and immersive experiences related to the subject rather than creating general-purpose models that knew a little about everything. For example, a model whose training data is 90% medical papers performs better on medical tasks than a much larger model where medical papers only make up 10% of its dataset.

In terms of dataset filtering, certain kinds of content have to be removed to reduce toxicity and bias. OpenAI recently confirmed that for example erotic content has been filtered.

It’s also important to create vocabularies based on how commonly words appear, removing colloquial conversation and common slang datasets.

Models have to be fine-tuned intend to ensure the accuracy of the information and avoid false information in the dataset.

LLMs are not commoditized, and some models have unique capabilities. GPT-4 accepts multimodal inputs like video and photos and writes up 25,000 words at a time while maintaining context. Google’s PaLM can generate text, images, code, videos, audio, etc.

Other models can provide facial expressions and voice.

Inference latency is higher in models with more parameters, adding extra milliseconds between query and response, which significantly impacts real-time applications.

Google’s research found that just half a second of added latency cause traffic to drop by 20%.

For low or real-time latency, many use cases, such as financial forecasting or video games, can’t be fulfilled by a standalone LLM. It’s required the orchestration of multiple models, specialized features, or additional automation, for text-to-speech, automatic speech recognition (ASR), machine vision, memory, etc.