• Latest
Training GPT-2 To Generate Haiku

Training GPT-2 To Generate Haiku

March 15, 2023
Nintendo Reveals New Switch Joy-Con Colours

Nintendo Reveals New Switch Joy-Con Pastel Colour Sets

June 2, 2023
Nintendo Announces Everybody 1-2-Switch!, Pre-Orders Now Live

Nintendo Announces Everybody 1-2-Switch!, Pre-Orders Now Live

June 2, 2023
Natural Star 🎥|| 💥💥#shorts #trending #viral #short #youtubeshorts

Natural Star 🎥|| 💥💥#shorts #trending #viral #short #youtubeshorts

June 2, 2023
Nintendo Reveals New Switch Joy-Con Colours

Nintendo Reveals New Switch Joy-Con Colours

June 2, 2023
Star Wars Jedi: Survivor Is $10 Off on Amazon

Far More Star Wars Jedi: Survivor Players Lowered the Difficulty Rather Than Raised It

June 2, 2023
Starship Troopers: Extermination Early Access Review

Starship Troopers: Extermination Early Access Review

June 2, 2023
'Pikachu Talk' App For Google Home & Assistant Devices Getting Shut Down

'Pikachu Talk' App For Google Home & Assistant Devices Getting Shut Down

June 2, 2023
Tears of the Kingdom’s Depths Hides One of the Biggest Zelda Secrets Ever

Tears of the Kingdom’s Depths Hides One of the Biggest Zelda Secrets Ever

June 2, 2023
Gal Gadot Blocked Marques Brownlee⁉️   Joe Rogan Experience

Gal Gadot Blocked Marques Brownlee⁉️ Joe Rogan Experience

June 2, 2023
AI is changing technology and Apple is lagging behind with Siri

Apple is looking for engineers to work with Generative AI

June 2, 2023

Rotating Run ‘n Gun Shooter ‘Roto Force’ is Heading to PC and Mobile in July and Available for Pre-Order Now – TouchArcade

June 2, 2023
iPad mini remains the best for people who want a portable tablet

iPad mini remains the best for people who want a portable tablet

June 2, 2023
Advertise with us
Friday, June 2, 2023
Bookmarks
  • Login
  • Register
GetUpdated
  • Game Updates
  • Mobile Gaming
  • Playstation News
  • Xbox News
  • Switch News
  • MMORPG
  • Game News
  • IGN
  • Retro Gaming
  • Tech News
  • Apple Updates
  • Jailbreak News
  • Mobile News
  • Software Development
  • Photography
  • Contact
No Result
View All Result
GetUpdated
No Result
View All Result
GetUpdated
No Result
View All Result
ADVERTISEMENT

Training GPT-2 To Generate Haiku

March 15, 2023
in Software Development
Reading Time:10 mins read
0 0
0
Share on FacebookShare on WhatsAppShare on Twitter


Purpose

In this article, I will show how to easily train GPT-class neural networks from home. Let me start by saying that we won’t train NN from scratch, as that would require 8 (eight!) A100-class GPUs at least and a massive dataset. Instead, we’ll focus on fine-tuning a pre-trained GPT-2 model using a smaller dataset, which anyone can easily make or find online. OpenAI has kindly released GPT-2 under Modified MIT License.

nanoGPT

We’ll use the nanoGPT repository created by Andrej Karpathy for fast and easy GPT training. He has a comprehensive video lecture explaining how GPT-2 works and how to train such a neural network. However, we’re interested in fine-tuning the model using our own dataset and seeing the difference from the original (GPT-2 trained by OpenAI).

Choosing Dataset

I decided to try teaching GPT-2 some poetry with a clearly identifiable style. Poetry is often quite simple, and GPT-2 can demonstrate good results in that domain. I chose to focus on haiku, a form of Japanese poetry in which each verse consists of three short lines.

An old silent pond

A frog jumps into the pond —

Splash! Silence again. — “The Old Pond” by Matsuo Bashō

Haiku Dataset Preparation

I found a dataset of haiku collections on Kaggle. It is protected by a license (CC BY 4.0, link to the source). The dataset is not so big, ~340k tokens (~20MB). In order to adapt the dataset to my needs, I cleaned it up by removing spaces, dashes, and other punctuation that, in my opinion, was unnecessary. In the end, I made sure that each line of the dataset was one haiku, separated by a semicolon ; and generate input.txtfile.

fishing boats;colors of;the rainbow

import csv
import re

with open('all_haiku.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    i = 0
    txt_file = open('input.txt', 'w')
    for row in reader:
        strings = row[1:4]
        upd_strings = []
        for str in strings:
            if len(str) > 0:
                if str[0] == ' ':
                    while str[0] == ' ':
                        str = str[1:]
                if str[-1] == ' ':
                    while str[-1] == ' ':
                        str = str[:-1] 
                upd_strings.append(re.sub(r"[^a-zA-Z ]+", "", str)) 
            else:
                upd_strings.append('')
        # skip text and label 
        if i > 0: 
            txt_file.write(upd_strings[0] + ';' + upd_strings[1] + ';' + upd_strings[2] +'n')
        i+=1

    print("Added", i-1, "strings of text")

After that, I used the prepare.py file from data/shakespeare to convert the input.txt into two files, train.bin and val.bin. The first file was used for training, and the second file was used for validation, as the name implies. The split was 90/10, but the ratio could always be adjusted (look for the 0.9 value in the file).

Data Preparation and Fine-Tuning

Next, I took the finetune_shakespeare.py file as a basis and modified it accordingly:

import time

out_dir="out-haiku"
eval_interval = 5
eval_iters = 40
wandb_log = True # feel free to turn on
wandb_project="haiku"
wandb_run_name="test3"

dataset="haiku"
init_from = 'gpt2-large' # this is the largest GPT-2 model

# only save checkpoints if the validation loss improves
always_save_checkpoint = False

# the number of examples per iter:
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
# haiku has 2,301,966 tokens, so 1 epoch ~= 70 iters * 10 number of iterations
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 1000

# finetune at constant LR
learning_rate = 1e-6
decay_lr = True
warmup_iters = 200#max_iters/10
lr_decay_iters = max_iters 
min_lr = learning_rate/10

compile=False

I selected a learning rate of 1e-6 through experimentation and also recommend using wandb, which provides better visualization of your experiments. Since I wasn’t using the new PyTorch 2.0, I added compile=False. I chose the ‘gpt2-large’ (772M) network because I was using an RTX3090 with 24GB of video memory, and the ‘gpt2-xl’ (1.5B) wouldn’t fit. You can use a different pre-trained GPT-2 model, depending on the hardware you’re using. The larger the network, the better the results will be.

Test Results

Test Results2


After 1,000 iterations, the validation loss has already stabilized, so we’ll consider the training to be complete. Training on my RTX 3090 was completed in ~170 min.

Testing

To test, I slightly modified the sample.py script, so it generates responses based on the prompt I provided.

” or etc. Can also specify a file, use as: “FILE:prompt.txt”
num_samples = 10 # number of samples to draw
max_new_tokens = 20 # number of tokens generated in each sample
temperature = 0.9 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
seed = 1381
device=”cuda” # examples: ‘cpu’, ‘cuda’, ‘cuda:0’, ‘cuda:1’, etc.
dtype=”bfloat16″ # ‘float32’ or ‘bfloat16’ or ‘float16’
compile = False # use PyTorch 2.0 to compile the model to be faster
exec(open(‘configurator.py’).read()) # overrides from command line or config file
# —————————————————————————–

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type=”cuda” if ‘cuda’ in device else ‘cpu’ # for later use in torch.autocast
ptdtype = {‘float32’: torch.float32, ‘bfloat16’: torch.bfloat16, ‘float16’: torch.float16}[dtype]
ctx = nullcontext() if device_type == ‘cpu’ else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

# model
if init_from == ‘resume’:
# init from a model saved in a specific directory
ckpt_path = os.path.join(out_dir, ‘ckpt.pt’)
checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint[‘model_args’])
model = GPT(gptconf)
state_dict = checkpoint[‘model’]
unwanted_prefix = ‘_orig_mod.’
for k,v in list(state_dict.items()):
if k.startswith(unwanted_prefix):
state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
elif init_from.startswith(‘gpt2’):
# init from a given GPT-2 model
model = GPT.from_pretrained(init_from, dict(dropout=0.0))

model.eval()
model.to(device)
if compile:
model = torch.compile(model) # requires PyTorch 2.0 (optional)

# look for the meta pickle in case it is available in the dataset folder
load_meta = False
if init_from == ‘resume’ and ‘config’ in checkpoint and ‘dataset’ in checkpoint[‘config’]: # older checkpoints might not have these…
meta_path = os.path.join(‘data’, checkpoint[‘config’][‘dataset’], ‘meta.pkl’)
load_meta = os.path.exists(meta_path)
if load_meta:
print(f”Loading meta from {meta_path}…”)
with open(meta_path, ‘rb’) as f:
meta = pickle.load(f)
# TODO want to make this more general to arbitrary encoder/decoder schemes
stoi, itos = meta[‘stoi’], meta[‘itos’]
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ”.join([itos[i] for i in l])
else:
# ok let’s assume gpt-2 encodings by default
print(“No meta.pkl found, assuming GPT-2 encodings…”)
enc = tiktoken.get_encoding(“gpt2”)
encode = lambda s: enc.encode(s, allowed_special={“<|endoftext|>”})
decode = lambda l: enc.decode(l)

# encode the beginning of the prompt
#if start.startswith(‘FILE:’):
# with open(start[5:], ‘r’, encoding=’utf-8′) as f:
# start = f.read()
start_ids = encode(promt)
x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, …])

print(‘—————‘)
# run generation
with torch.no_grad():
with ctx:
for k in range(num_samples):
y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
print(decode(y[0].tolist()))
print(‘—————‘)” data-lang=”text/x-python”>

"""
Sample from a trained model
"""
import os
import pickle
from contextlib import nullcontext
import torch
import tiktoken
from model import GPTConfig, GPT

promt = "Full Moon is shiningn"

# -----------------------------------------------------------------------------
init_from = 'resume' # either 'resume' (from an out_dir) or a gpt2 variant (e.g. 'gpt2-xl')
out_dir="out-haiku_1k" # ignored if init_from is not 'resume'
start = "n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
num_samples = 10 # number of samples to draw
max_new_tokens = 20 # number of tokens generated in each sample
temperature = 0.9 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
seed = 1381
device="cuda" # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype="bfloat16" # 'float32' or 'bfloat16' or 'float16'
compile = False # use PyTorch 2.0 to compile the model to be faster
exec(open('configurator.py').read()) # overrides from command line or config file
# -----------------------------------------------------------------------------

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type="cuda" if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

# model
if init_from == 'resume':
    # init from a model saved in a specific directory
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)
elif init_from.startswith('gpt2'):
    # init from a given GPT-2 model
    model = GPT.from_pretrained(init_from, dict(dropout=0.0))

model.eval()
model.to(device)
if compile:
    model = torch.compile(model) # requires PyTorch 2.0 (optional)

# look for the meta pickle in case it is available in the dataset folder
load_meta = False
if init_from == 'resume' and 'config' in checkpoint and 'dataset' in checkpoint['config']: # older checkpoints might not have these...
    meta_path = os.path.join('data', checkpoint['config']['dataset'], 'meta.pkl')
    load_meta = os.path.exists(meta_path)
if load_meta:
    print(f"Loading meta from {meta_path}...")
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    # TODO want to make this more general to arbitrary encoder/decoder schemes
    stoi, itos = meta['stoi'], meta['itos']
    encode = lambda s: [stoi[c] for c in s]
    decode = lambda l: ''.join([itos[i] for i in l])
else:
    # ok let's assume gpt-2 encodings by default
    print("No meta.pkl found, assuming GPT-2 encodings...")
    enc = tiktoken.get_encoding("gpt2")
    encode = lambda s: enc.encode(s, allowed_special={"<|endoftext|>"})
    decode = lambda l: enc.decode(l)

# encode the beginning of the prompt
#if start.startswith('FILE:'):
#    with open(start[5:], 'r', encoding='utf-8') as f:
#        start = f.read()
start_ids = encode(promt)
x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])

print('---------------')
# run generation
with torch.no_grad():
    with ctx:
        for k in range(num_samples):
            y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
            print(decode(y[0].tolist()))
            print('---------------')

During text generation, I provided an initial text, “Full Moon is shining,” but did not stop the model at three lines, although I could have — often, interesting verses were longer than three lines. For comparison, I will first show the output of the original GPT-2-large on this prompt and then what I was able to generate.

GPT2-large initial output:

Full Moon is shining

At the end of the month

A beautiful woman is looking

at her face in (the mirror)

And one more:

Full Moon is shining

And it’s getting hard to say goodbye

To the days that were good and true

GPT2-large fine-tuned on haiku:

Full moon is shining

Like the blood, like the flames

To save you from the heart

I pray the fallen

And one more:

Full moon is shining

and when i see it

your eyes can’t hide my tears

one more time i will come

It can be observed that the content of the poems changes significantly — after fine-tuning the model, it produces more poetic expressions, more meaning in the words, and more allegories. One could even argue that there is a deeper meaning in the words, as should be the case in haiku.

Conclusions

GPT-2 is great for generating poetry — here, we can truly see the potential of this technology. Today there are models like GPT-3, which are much larger and have more “general” knowledge, but GPT-2 can be trained on a specific narrow task, fast and at home (in my case). And tools like nanoGPT repository can be great for fast training.

The scope for experimentation is only limited by your imagination, as the text input into the neural network is encoded as a sequence of numbers. Theoretically, one can train not only on text examples, but that’s a topic beyond the scope of this article.

Here is my fork of nanoGPT repository.



Source link

ShareSendTweet
Previous Post

CHARLESTON WHITE SENDS DISTURBING MESSAGE TO CHINA MAC AND THE ASAIN COMMUNITY! #ShowfaceNews

Next Post

J PRINCE VIOLATES WACK 100 FOR LYING ABOUT HIS ARREST! VERY HEATED! #ShowfaceNews

Related Posts

Using Render Log Streams to Log to Papertrail

June 2, 2023
0
0
Using Render Log Streams to Log to Papertrail
Software Development

The console.log function — the poor man’s debugger — is every JavaScript developer’s best friend. We use it to verify that a certain piece...

Read more

The Dark Side of DevSecOps: Why We Need Governance

June 2, 2023
0
0
The Dark Side of DevSecOps: Why We Need Governance
Software Development

For today’s software organizations security has never been more top of mind. On one side, there is the present and...

Read more
Next Post
J PRINCE VIOLATES WACK 100 FOR LYING ABOUT HIS ARREST! VERY HEATED! #ShowfaceNews

J PRINCE VIOLATES WACK 100 FOR LYING ABOUT HIS ARREST! VERY HEATED! #ShowfaceNews

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

© 2021 GetUpdated – MW.

  • About
  • Advertise
  • Privacy & Policy
  • Terms & Conditions
  • Contact

No Result
View All Result
  • Game Updates
  • Mobile Gaming
  • Playstation News
  • Xbox News
  • Switch News
  • MMORPG
  • Game News
  • IGN
  • Retro Gaming
  • Tech News
  • Apple Updates
  • Jailbreak News
  • Mobile News
  • Software Development
  • Photography
  • Contact

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?