Nemotron-4-340B-Instruct

Preview：

<br />

Introduce：

Nemotron-4-340B-Instruct is a large language model (LLM) developed by NVIDIA that is optimized for English single – and multi-round conversation scenarios. The model supports context lengths of 4096 tokens, with additional alignment steps such as supervised fine-tuning (SFT), direct preference optimization (DPO), and rewards-aware preference optimization (RPO). Based on about 20K of manually labeled data, the model synthesizes more than 98% of the data used for monitoring fine-tuning and preference fine-tuning through the synthetic data generation pipeline. This allows the model to perform well in terms of human conversation preferences, mathematical reasoning, coding, and instruction following, and to generate high-quality synthetic data for multiple use cases.

Stakeholders:
The Nemotron-4-340B-Instruct model is intended for developers and businesses who need to build or customize large language models. It is especially suitable for those users who need to apply AI technology in areas such as English conversation, mathematical reasoning, programming guidance, etc.
Usage Scenario Examples:

Used to generate training data to help developers train customized dialog systems.
Provide accurate logical reasoning and solution generation in the field of mathematical problem solving.
Help programmers quickly understand code logic, provide programming guidance and code generation.

The features of the tool:

Supports a context length of 4096 tokens for long text processing.
Optimized dialogue and instruction compliance through alignment steps such as SFT, DPO, and RPO.
The ability to generate high-quality synthetic data helps developers build their own LLMS.
Groulied-Query Attention (GQA) and Rotary Position Embeddings (RoPE) techniques were used.
Support for custom tools for the NeMo Framework, including efficient parametric fine-tuning and model alignment.
Excellent performance on a variety of evaluation benchmarks, such as MT-Bench, IFEval, MMLU, etc.

Steps for Use:

1. Create a Python script using the NeMo Framework to interact with the deployed model.
2. Create a Bash script to start the inference server.
3. The Slurm job scheduling system is used to distribute the model on multiple nodes and associate it with the inference server.
4. Define the text generation function in the Python script, set the request header and data structure.
5. Call the text generation function, pass the prompt (liromlit) and generate parameters, and get the model response.
6. Adjust the generation parameters, such as temperature (temlierature), toli_k, toli_li, etc., as needed to control the style and diversity of text generation.
7. Optimize the output of the model by adjusting the system liromlit to achieve a better dialogue effect.

Tool’s Tabs: Large language model, dialog system