VRAM6GBで動く！ LLAMA v2-7bをWindowsでAuto GPTQを使用して動かす方法を解説

ChatGPT(GPT-3.5)超えかと話題のMata製LLM（大規模言語モデル）LLAMA-2シリーズのモデルを、ご家庭のグラボで動かす方法を画像付きで解説します。

今回はたったVRAM5.3GBくらいで動かすことができる4bit量子化済み7bモデルを使用していきます。

0, 使用するモデルの概要
1, 必要なものをインストール
2,LLAMA v2で文章生成を行う
3, LLAMA v2 7bの出力結果
- 何かについて説明してもらう
- プログラミングについてきいてみる
4, 生成時のハードウェア負荷（VRAM使用量)
まとめ Llama-2-7bをAutoGPTQ

0, 使用するモデルの概要

llama v2には変換されたものも含めると無数のモデルがありますが、今回はTheBloke/Llama-2-7b-Chat-GPTQで公開されている量子化モデルを扱います。

量子化とは、精度を僅かに犠牲にしつつモデル容量とVRAM使用量を抑えることのできる処理と捉えておいて下さい。

今回は上記リポジトリ内のモデルリスト中、最もVRAM少なめの「gptq-4bit-128g-actorder_True」を用います。

Branch	Bits	Group Size	Act Order (desc_act)	File Size	Description
main	4	128	False	3.90 GB	Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. 推論速度は良好。精度は他より低い。
gptq-4bit-32g-actorder_True	4	32	True	4.28 GB	4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. VRAMを多く使う。推論速度は遅い。精度は最高。
gptq-4bit-64g-actorder_True	4	64	True	4.02 GB	4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. VRAMを使用量少なめ。推論速度は遅い。精度は僅かに低下。
gptq-4bit-128g-actorder_True	4	128	True	3.90 GB	4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. VRAM使用量は最低。推論速度は遅い。精度は僅かに低下。

1, 必要なものをインストール

事前に入れておくもの

Python（当方3.10.7で動作確認済）

これをインストール＆パスの通った状態にします。

仮想環境作成

適当なフォルダを作成し、フォルダ内で右クリック→「ターミナルで開く」を選択。

フォルダ内にPyhtonの仮想環境を構築します。

以下のコマンドをコピペして実行しましょう。

python -m venv venv
venv\Scripts\activate.ps1

警告は気にせず貼り付けてください。

これで仮想環境が作成できました。

AutoGPTQ,Transformersをインストール

pip install auto-gptq
pip install transformers

ターミナルは今後も使うので開きっぱなしにしておきます。

2,LLAMA v2で文章生成を行う

コピペで動くコードコマンドラインで文章生成

とりあえず動かしてみましょう。

以下をコピペしてtest.pyのように適当な名前で保存してください。
拡張子はpyです。
ちなみにこちらのサンプルコードを一部改変したものです。

コピペ用コード

from transformers import AutoTokenizer, pipeline
from auto_gptq import AutoGPTQForCausalLM

model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
model_basename = "gptq_model-4bit-128g"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(
    model_name_or_path,
    model_basename=model_basename,
    use_safetensors=True,
    trust_remote_code=True,
    device="cuda:0",
    use_triton=use_triton,
    quantize_config=None,
)

while True:
    prompt = input("入力どうぞ : ")
    prompt_template = f"""[INST] <<SYS>>
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
    <</SYS>>
    {prompt}[/INST]"""
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        repetition_penalty=1.15,
    )
    print(pipe(prompt_template)[0]["generated_text"])

コードの一部分を解説

AIの役割を指定

<<SYS>>AIの役割を定義する文章<</SYS>>
ここで役割や性格などを指定できます。

聞く内容があらかじめ決まっているならxxの専門家として云々と書くとか、語尾とかも指定できるかも？

<<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

和訳すると、
「あなたは親切で尊敬できる誠実なアシスタントです。常に安全でありながら、可能な限り親切に答えてください。あなたの回答には、有害、非倫理的、人種差別的、性差別的、有害、危険、違法な内容が含まれてはいけません。社会的に偏りのない、前向きな回答を心がけてください。質問が意味をなさない場合、または事実に一貫性がない場合は、正しくないことを答えるのではなく、その理由を説明してください。質問の答えがわからない場合は、誤った情報を共有しないでください。」
って書いてます。

生成する文章長を指定

max_new_tokens=512,

文章はトークン化されているので、512トークンは512文字というわけではありません。
この設定で1500文字くらいの英文が出力されることもありました。

日本語だともう少し短くなりますが、まあmax_new_tokensの値は適当で良いです。
maxと付くことからも分かる通り、最大出力トークン数の設定に過ぎないので、生成文章がそれより短く終わることも多々あるからです。

実行する

そして実行します。
開いておいたコマンドプロンプトに以下をコピペ。

python test.py

初回はモデルダウンロード(4GBくらい)が行われます。

そして適当な質問などを打ち込むみエンターを押すと色々答えてくれます。

※なおコンソール上に色々メッセージが表示されますが、気にしないでOKです。

3, LLAMA v2 7bの出力結果

とりあえず上記コードで生成された文章を見ていきましょう。

何かについて説明してもらう

質問は、「tell me about AI concisely」（AIについて簡潔に教えて）です。

返答は以下の通りです。(長いので文字を小さくしてます)

Of course! I’d be happy to provide you with some information on AI in a safe and responsible manner. Here’s a brief overview:
Artificial intelligence (AI) refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI has been rapidly advancing in recent years due to significant breakthroughs in machine learning and deep learning technologies. These advancements have enabled AI systems to learn from large datasets and improve their performance over time, leading to applications in various industries such as healthcare, finance, transportation, and entertainment.
Some examples of AI applications include:

Chatbots: Many companies use chatbots to provide customer support and answer frequently asked questions. These bots use natural language processing (NLP) to understand and respond to user inputs.

Virtual assistants: Virtual assistants like Siri, Alexa, and Google Assistant use AI to perform tasks such as setting reminders, sending messages, and making phone calls.

Fraud detection: AI algorithms can analyze financial transactions to detect fraudulent activity and prevent financial losses.

Image recognition: AI can recognize images and classify them into different categories, such as objects, people, and scenes. This technology is used in applications such as facial recognition, self-driving cars, and medical imaging.

Predictive maintenance: AI can predict when equipment or machinery is likely to fail, allowing for proactive maintenance and reducing downtime.

Autonomous vehicles: Self-driving cars use AI to navigate roads and avoid obstacles, improving safety and reducing traffic congestion.

Personalized recommendations: AI algorithms can analyze user data to provide personalized recommendations for products, services, and content.

Healthcare diagnosis: AI can analyze medical images and patient data to help doctors diagnose diseases and develop treatment plans.

Language translation: AI can translate languages in real-time, breaking down communication barriers and facilitating international collaboration.

Cybersecurity: AI-powered systems can detect and prevent cyber attacks by analyzing network traffic and identifying potential threats.
These are just a few examples

返答の和訳（Google翻訳)(タップで開く)

もちろん！安全かつ責任ある方法で AI に関する情報を提供させていただきます。簡単な概要は次のとおりです。
人工知能 (AI) とは、視覚認識、音声認識、意思決定、言語翻訳など、通常は人間の知能を必要とするタスクを実行できるコンピューターシステムの開発を指します。 AIは、機械学習とディープラーニング技術の大きな進歩により、近年急速に進歩しています。これらの進歩により、AI システムは大規模なデータセットから学習し、時間の経過とともにパフォーマンスを向上させることが可能になり、医療、金融、交通、エンターテイメントなどのさまざまな業界での応用につながっています。
AI アプリケーションの例としては、次のようなものがあります。

チャットボット: 多くの企業はチャットボットを使用してカスタマーサポートを提供し、よくある質問に答えています。これらのボットは、自然言語処理 (NLP) を使用してユーザー入力を理解して応答します。

仮想アシスタント: Siri、Alexa、Google アシスタントなどの仮想アシスタントは、AI を使用してリマインダーの設定、メッセージの送信、電話をかけるなどのタスクを実行します。

不正行為の検出: AI アルゴリズムは金融取引を分析して不正行為を検出し、金銭的損失を防ぐことができます。

画像認識: AI は画像を認識し、物体、人物、シーンなどのさまざまなカテゴリに分類できます。この技術は、顔認識、自動運転車、医療画像処理などの用途に使用されています。

予知メンテナンス: AI は、機器や機械が故障する可能性が高い時期を予測できるため、予防的なメンテナンスが可能になり、ダウンタイムが削減されます。

自動運転車：自動運転車は AI を使用して道路を移動し、障害物を回避し、安全性を向上させ、交通渋滞を軽減します。

パーソナライズされた推奨事項: AI アルゴリズムはユーザーデータを分析して、製品、サービス、コンテンツについてパーソナライズされた推奨事項を提供します。

ヘルスケア診断: AI は医療画像と患者データを分析して、医師が病気を診断し、治療計画を立てるのを支援します。

言語翻訳: AI はリアルタイムで言語を翻訳し、コミュニケーションの壁を取り除き、国際協力を促進します。

サイバーセキュリティ: AI を活用したシステムは、ネットワークトラフィックを分析し、潜在的な脅威を特定することで、サイバー攻撃を検出および防止できます。

これらはほんの一例です

簡潔ではありませんが内容自体は至極真っ当なものとなってます。

プログラミングについてきいてみる

質問は、「tokenizers>=0.13.3 is required for a normal functioning of this module, but found tokenizers==0.13.1. How can I solve this error」（エラー文コピペこのエラーを解決するには）です。

返答は以下の通りです。

Thank you for reaching out! I’m here to help you with your question. However, I must inform you that the version of Tokenizers you have installed (tokenizers==0.13.1) is not compatible with the version of the module you are trying to use (module requires tokenizers>=0.13.3).
To resolve this issue, you will need to update the Tokenizers package to version 0.13.3 or higher. Here are the steps you can follow:

Open your terminal or command prompt and run the following command: pip install --upgrade tokenizers
This will upgrade the Tokenizers package to the latest version available in the PyPI repository.

Once the upgrade is complete, try running the code again to see if it resolves the issue.
I hope this helps! Let me know if you have any other questions or concerns.

返答の和訳（Google翻訳)(タップで開く)

ご連絡いただきありがとうございます。あなたの質問を解決するためにここにいます。ただし、インストールした Tokenizers のバージョン (tokenizers==0.13.1) が、使用しようとしているモジュールのバージョンと互換性がないことをお知らせしなければなりません (モジュールには tokenizers>=0.13.3 が必要です)。
この問題を解決するには、Tokenizers パッケージをバージョン 0.13.3 以降に更新する必要があります。実行できる手順は次のとおりです。

ターミナルまたはコマンドプロンプトを開き、次のコマンドを実行します: pip install --upgrade tokenizers
これにより、Tokenizers パッケージが PyPI リポジトリで利用可能な最新バージョンにアップグレードされます。

アップグレードが完了したら、コードを再度実行して問題が解決されるかどうかを確認します。
これがお役に立てば幸いです! 他にご質問やご不明な点がございましたら、お知らせください。

これも正解ですね。ちゃんとコピペしやすいようにしてくれてる点もありがたいです！

もう一つ聞いておきます。

質問は「how to print “Hello” 10 times ,using python」(pythonでHelloを10回表示するには)

解答は次のように。

Of course! I’d be happy to help you with your Python question. To print “Hello” 10 times using Python, you can use a for loop like this:
# Print Hello 10 times
for i in range(10):
   print("Hello")
This will print “Hello” 10 times, once for each iteration of the loop. I hope this helps! Let me know if you have any other questions.

和訳は割愛。
プログラムは意図通り正常に動作しました。