# @title Install llama.cpp and HuggingFace Hub
# @markdown This cell takes approximately 2 minutes to run. The output is suppressed, so if no error is shown, you may assume that it worked.
%%capture
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.11 --force-reinstall --upgrade --no-cache-dir
!pip install huggingface_hub==0.18.0Feel free to save a copy on your Google Drive before you begin.
Llama.cpp is a project led by Georgi Gerganov that was initially designed as a pure C/C++ implementation of the Llama large language model developed and open-sourced by Meta’s AI team.
Quoted from the llama.cpp GitHub repository:
The main goal of llama.cpp is to run the LLaMA model using 4-bit integer quantization on a MacBook - Plain C/C++ implementation without dependencies - Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks - AVX, AVX2 and AVX512 support for x86 architectures - Mixed F16 / F32 precision - 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support - CUDA, Metal and OpenCL GPU backend support
In lay terms, this means that we can implement these models in such a way that they can be run on nearly any physical or virtual machine! You don’t need an industrial-grade, multi-GPU server to use open-source LLMs locally.
When to Use an LLM Locally
- You have sensitive data that you don’t want to send to OpenAI’s servers for them to potentially store and use for the training of futures models
- Virtually all healthcare data
- You want to fine-tune an open-source LLM for a specific purpose
Overview of This Module
- Install llama.cpp and Hugging Face Hub (to download model files)
- Download the 7 billion parameter Llama2 model fine-tuned for chat
- Engineer a prompt to have the LLM read a chest radiography report and return structured labels for specific findings in JSON format.
- Test a few example reports on Llama2-7B-Chat.
- Repeat the process for the Mistral-7B-Instruct-v0.1 model and compare the results.
Note: At the time this module was developed, Mistral-7B is the best open-source, 7B parameter model available. This field is moving very quickly, so this very well could change before the end of the year.
References
- Llama.cpp on GitHub: https://github.com/ggerganov/llama.cpp
- Meta AI’s Llama 2: https://ai.meta.com/llama/
- MistralAI’s Mistral-7B: https://mistral.ai/news/announcing-mistral-7b/
- HuggingFace Models:
Note: If you would like to experiment with other models, please search for the “GGUF” version of the model on Hugging Face.
# @title Importing the necessary libraries
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import regex as re
import json# @title Select the model you'd like to test
# @markdown After initially testing with one model, if you would like to test another then you must change your selection in this cell. Then you will need to re-run this cell and all of the ones below it. You can do this from the `Runtime` menu bar by selecting `Run after`.
model = "llama-2" # @param ["llama-2", "mistral"]
if model == "llama-2":
model_name = "TheBloke/Llama-2-7b-Chat-GGUF"
model_basename = "llama-2-7b-chat.Q4_K_M.gguf"
else:
model_name = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
model_basename = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"# @title Download the model from Hugging Face Hub
model_path = hf_hub_download(repo_id=model_name, filename=model_basename)# @title Initialize the llama.cpp constructor
# Feel free to play around with different hyperparameters below
lcpp_llm = Llama(
model_path=model_path,
n_threads=2, # CPU cores
n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU. Should be a power of 2.
n_gpu_layers=36, # Change this value based on your model and your GPU VRAM pool.
n_ctx=2048, # Context window = maximum input sequence length (in tokens)
n_gqa=8,
)AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Prompt Engineering
Prompt engineering has emerged as an important skill set in getting LLMs to execute your desired task. For this, you should know if there is a prompt template
- We start with a
systemprompt. This gives the LLM a role to play in the requests that follow. - We implement a JSON
schemato prompt the LLM to return structured labels for each report we submit. - We provide a sample
reportfor the LLM to analyze. - We construct the
promptthat will present the report text to the model, ask it to use the JSON schema provided, and analyze the report for the findings included in the schema. - Finally, we utilize the
prompt templatesfor the Llama-2-Chat and Mistral-7B-Instruct-v0.1 models to construct our complete prompt.
Note: Mistral-7B does not have a separate delimiter for the system role, so we pass that portion of the prompt with the remainder.
For more details on prompt engineering, see this guide: Prompt Engineering Guide
# @title System prompt
# @markdown In your experimentation, you may change the text in the following field to see the effect the "system" prompt has on the model output.
system = "You are an expert radiologist's assistant, skilled in analyzing radiology reports. Please first provide a response to any specific requests. Then explain your reasoning." # @param {type: "string"}# @title Construct JSON schema
schema = '''
{
"cardiomegaly": { "type": "boolean" },
"lung_opacity": { "type": "boolean" },
"pneumothorax": { "type": "boolean" },
"pleural_effusion": { "type": "boolean" },
"pulmonary_edema": { "type": "boolean" },
"abnormal_study": { "type": "boolean" }
}
'''# @title Provide a sample chest radiograph report
# @markdown A sample normal chest radiography report is provided for you here. If you would like to experiment, change the text in the field below and re-run this cell and the cells below.
report_text = "No focal consolidation, pneumothorax, or pleural effusion. Cardiomediastinal silhouette is stable and unremarkable. No acute osseous abnormalities are identified. No acute cardiopulmonary abnormality." # @param {type: "string"}# @title Construct User prompt
# @markdown I've included an additional instruction here to help the model understand that there is some overlap between lung opacity and other categories. As you may see below, this can actually confuse some models.
# @markdown <br><br>While some prompt engineering techniques can be helpful, you have to experiment to see what produces robust and consistent outputs.
#@markdown <br><br>You can delete the following text entirely if you do not want to provide additional instructions.
additional_instructions = "Note that 'lung_opacity' may include nodule, mass, atelectasis, or consolidation." # @param {type:"string"}
prompt = f'''
```{report_text}```
Please extract the findings from the preceding text radiology report using the following JSON schema:
```{schema}```
{additional_instructions}
'''# @title Llama-2-Chat & Mistral-7B-Instruct-v0.1 prompt templates
# @markdown Using the correct prompt formatting with special tokens like `[INST]` can greatly improve your chances of getting a good response from an LLM. If you're unsure of the appropriate template, check the model card on Hugging Face, or the website or original paper for the model you're using.
llama2_prompt_template = f'''[INST] <<SYS>>
{system}
<</SYS>>
{prompt}[/INST]
'''
mistral_prompt_template = f'''<s>[INST] {system} {prompt} [/INST]'''# @title Generate LLM response and print response text
if model == "llama-2":
full_prompt = llama2_prompt_template
else:
full_prompt = mistral_prompt_template
#@markdown After initial testing, consider experimenting with some of the hyperparameters below.
#@markdown - `max_tokens`: maximum model output
#@markdown - `temperature`: a.k.a. entropy, increases randomness of output. Higher produces more human-like responses. `0` does not guarantee deterministic output.
#@markdown <p>See the LLM settings guide linked below for more details on experimenting with hyperparameters.
max_tokens = 512 #@param {type:"integer"}
temperature = 0.5 #@param {type:"slider", min:0, max:1, step:0.1}
top_p = 0.95 #@param {type:"slider", min:0.8, max:1, step:0.05}
response = lcpp_llm(
prompt=full_prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
repeat_penalty=1.2,
top_k=50,
# echo=True, # return the prompt
);
res_txt = response["choices"][0]["text"]
print(res_txt)Of course! I'd be happy to help you analyze the radiology report. Here are my findings based on the JSON schema provided:
{
"cardiomegaly": false,
"lung_opacity": true,
"pneumothorax": false,
"pleural_effusion": false,
"pulmonary_rama": false,
"abnormal_study": true
}
Explanation:
The report states that there is no focal consolidation, pneumothorax, or pleural effusion. However, it does mention that the cardiomediastinal silhouette is stable and unremarkable, which suggests that there are no signs of cardiac tamponade or other abnormalities in this area. Additionally, the report states that no acute osseous abnormalities were identified, which means that there are no bone fractures or dislocations present. Finally, the report concludes that there is an abnormal study, which indicates that something unusual was detected during the imaging process.
I hope this helps! Let me know if you have any further questions.
Limitations of this Approach
- Errors: You may observe when using Llama-2-7B-Chat that the JSON returned is not ideal for what we requested or may even have an error like turning
pulmonary_edemaintopulmonary_emia.- This can be improved by simplifying your request for smaller models or using a model that is better trained for returning structured data in JSON format, like Mistral-7B.
- Playing around with some of the model inference hyperparameters can also help. See this guide for further details: Prompt Engineering Guide: LLM Settings
- Hallucinations: LLMs can provide very confident answers that are flat out wrong. You may see output like
"Under the 'lung_opacity' field, the report mentions that there is opacity in both lungs, which could indicate nodules, masses, atelectasis, or consolidation. Therefore, the value for this field is set to true.", even when there is no mention of that in the report referenced!- This can be improved by careful prompt engineering. You may want to include in your
system promptan instruction to not return an answer if the model is not confident. Or you may want to try without having the model explain it’s reasoning. - A group at NIH found that asking Vicuna-13B to perform a single labeling task at one time provided more robust results in this article published in Radiology: Feasibility of Using the Privacy-preserving Large Language Model Vicuna for Labeling Radiology Reports
- For certain use cases, retrieval-augmented generation (RAG) can be helpful. We’ll cover that in the next notebook.
- Finally, if all else fails and you have several hundred labeled examples of the task you want the LLM to perform, you may consider parameter-efficient fine-tuning (PEFT). See this guide from NVIDIA for more details: Selecting LLM Customization Techniques
- This can be improved by careful prompt engineering. You may want to include in your
# @title Define a function to postprocess the response text and extract the JSON object into a Python dict
def json_from_str(s):
expr = re.compile(r'\{(?:[^{}]*|(?R))*\}')
res = expr.findall(s)
return json.loads(res[0]) if res else None# @title Assign an ID number to the report and associate extracted labels with the report ID
id = 1
labels = json_from_str(res_txt)
result_dict = {id: labels}
result_dict{1: {'cardiomegaly': False,
'lung_opacity': True,
'pneumothorax': False,
'pleural_effusion': False,
'pulmonary_rama': False,
'abnormal_study': True}}