NLP(二十六):如何微调 GPT-2 以生成文本

摘要:
近年来,自然语言生成取得了令人难以置信的进展。2019年初,OpenAI发布了GPT-2,这是一个巨大的预训练模型,可以生成与人类素质相似的文本。顾名思义,GenerativePretrainedTransformer2基于Transformer。本文的目的是向您展示如何根据您提供的GPT-2数据微调GPT-2以生成上下文相关文本。这个想法是使用经过训练的模型,根据我们的特定数据对其进行微调,然后根据模型观察到的内容生成任何给定歌曲中应该遵循的内容。
近年来,自然语言生成 (NLG) 取得了令人难以置信的进步。 2019 年初,OpenAI 发布了 GPT-2,这是一个巨大的预训练模型(1.5B 参数),能够生成类似人类质量的文本。Generative Pretrained Transformer 2 (GPT-2) 顾名思义,基于 Transformer。 因此,它使用注意力机制,这意味着它会学习关注与上下文最相关的前一个词,以预测下一个词(有关更多信息,请转到此处)。本文的目的是向您展示如何根据您提供给 GPT-2 的数据微调 GPT-2 以生成与上下文相关的文本。例如,我将生成歌词。 这个想法是使用已经训练好的模型,根据我们的特定数据对其进行微调,然后根据模型观察到的内容,生成任何给定歌曲中应该遵循的内容。

一、准备数据

GPT-2 本身可以生成质量不错的文本。 但是,如果您希望它在特定上下文中做得更好,则需要根据您的特定数据对其进行微调。 就我而言,由于我想生成歌词,我将使用以下 Kaggle 数据集(https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres)其中包含总共 12,500 首流行摇滚歌曲歌词,全部为英文。

数据展示:artists-data.csv

Artist,Songs,Popularity,Link,Genre,Genres
10000 Maniacs,110,0.3,/10000-maniacs/,Rock,Rock; Pop; Electronica; Dance; J-Pop/J-Rock; Gospel/Religioso; Infantil; Emocore
12 Stones,75,0.3,/12-stones/,Rock,Rock; Gospel/Religioso; Hard Rock; Grunge; Rock Alternativo; Hardcore; Punk Rock; Chillout; Electronica; Heavy Metal; Metal; World Music; Axé; Emocore
311,196,0.5,/311/,Rock,Rock; Surf Music; Reggae; Ska; Pop/Rock; Rock Alternativo; Hardcore
4 Non Blondes,15,7.5,/4-non-blondes/,Rock,Rock; Pop/Rock; Rock Alternativo; Grunge; Blues; Pop; Soft Rock; Power-Pop; Piano Rock; Indie; Chillout
A Cruz Está Vazia,13,0,/a-cruz-esta-vazia/,Rock,Rock
Aborto Elétrico,36,0.1,/aborto-eletrico/,Rock,Rock; Punk Rock; Pós-Punk; Post-Rock
Abril,36,0.1,/abril/,Rock,Rock; Emocore; Hardcore; Pop/Rock; Rock Alternativo; Romântico; Hard Rock; Blues; World Music
Abuse,13,0,/abuse/,Rock,Rock; Hardcore
AC/DC,192,10.8,/ac-dc/,Rock,Rock; Heavy Metal; Classic Rock; Hard Rock; Clássico; Metal; Punk Rock; Blues; Black Music; Rockabilly; Psicodelia; Funk Carioca; Rock Alternativo; Trilha Sonora; New Age; Hip Hop; New Wave; Sertanejo; Post-Rock; Pop/Rock; MPB; Electronica; Grunge; Progressivo; Pop/Punk; Funk; Forró
ACEIA,0,0,/aceia/,Rock,Rock
Acid Tree,5,0,/acid-tree/,Rock,Rock; Heavy Metal; Metal
Adam Lambert,110,1.4,/adam-lambert/,Pop,Pop; Pop/Rock; Rock; Romântico; Dance; Electronica; Emocore; Power-Pop; Axé; Gótico; R&B; Punk Rock; Pop/Punk; Black Music; Rock Alternativo; World Music; J-Pop/J-Rock; Gospel/Religioso; Hip Hop; K-Pop/K-Rock; Piano Rock; Heavy Metal; Velha Guarda; Soul Music; Hard Rock; Country; Soft Rock; Tecnopop; House; Trilha Sonora; Blues
Adrian Suirady,7,0,/adrian-suirady/,Rock,Rock; Gótico
Aerosmith,249,16.5,/aerosmith/,Rock,Rock; Hard Rock; Heavy Metal; Romântico; Pop/Rock; Classic Rock; Rock Alternativo; Blues; Metal; Chillout; Piano Rock; Funk; Gótico; Forró; Jovem Guarda; Hip Hop
Aliados,75,0.8,/aliados/,Rock,Rock; Pop/Rock; Rock Alternativo; Surf Music; Hardcore; Pop/Punk; Blues; R&B; Punk Rock; Axé
Alice Cooper,310,1.2,/alice-cooper/,Rock,Rock; Hard Rock; Heavy Metal; Punk Rock; Classic Rock; Grunge; Trilha Sonora; Gótico
Alter Bridge,74,1.4,/alter-bridge/,Rock,Rock; Hard Rock; Rock Alternativo; Heavy Metal; Grunge; Romântico; Rap; Metal; Hardcore
Amy Lee,33,0.5,/amy-lee/,Rock,Rock; Gótico; Hard Rock; Rock Alternativo; Heavy Metal; Piano Rock; Romântico; Metal; Indie; Classic Rock; New Age; Funk; Electronica; Industrial; Post-Rock; Psicodelia; Funk Carioca; Infantil; Pós-Punk; Dance; Pop; Clássico; Axé; Trilha Sonora
Anberlin,98,0.1,/anberlin/,Rock,Rock; Rock Alternativo; Hardcore; Emocore; Gospel/Religioso
Andi Deris,44,0,/andi-deris/,Rock,Rock; Hard Rock; Heavy Metal
Andrew W.K.,31,0,/andrew-w-k/,Rock,Rock
Andy (Brasil),7,0,/andy-brasil/,Rock,Rock
Angra,124,2.2,/angra/,Rock,Rock; Heavy Metal; Hard Rock; Progressivo; Metal; Black Music; Piano Rock; Post-Rock; Romântico; Psicodelia; Hardcore; Clássico; Forró; Pagode
Arthur Brown,2,0,/arthur-brown/,Rock,Rock
Asking Alexandria,77,1,/asking-alexandria/,Rock,Rock; Hard Rock; Hardcore; Heavy Metal; Emocore; Metal; Rock Alternativo; K-Pop/K-Rock; Classic Rock; Samba; Tecnopop; Grunge; Reggae; Chillout; World Music; Pop/Rock; Black Music; Gótico; Punk Rock; New Age
Autoramas,67,0.1,/autoramas/,Rock,Rock; Pop/Rock; Rock Alternativo; Progressivo; Indie; Punk Rock; Hardcore; Surf Music; Electronica; Funk; Pagode; Ska; R&B; Samba; New Age; MPB; Axé; Funk Carioca; Emocore; Grunge
Avante,21,0,/avante/,Rock,Rock

数据展示:lyrics-data.csv

ALink,SName,SLink,Lyric,Idiom
/10000-maniacs/,More Than This,/10000-maniacs/more-than-this.html,I could feel at the time. There was no way of knowing. Fallen leaves in the night. Who can say where they're blowing. As free as the wind. Hopefully learning. Why the sea on the tide. Has no way of turning. More than this. You know there's nothing. More than this. Tell me one thing. More than this. You know there's nothing. It was fun for a while. There was no way of knowing. Like a dream in the night. Who can say where we're going. No care in the world. Maybe I'm learning. Why the sea on the tide. Has no way of turning. More than this. You know there's nothing. More than this. Tell me one thing. More than this. You know there's nothing. More than this. You know there's nothing. More than this. Tell me one thing. More than this. There's nothing.,ENGLISH
/10000-maniacs/,Because The Night,/10000-maniacs/because-the-night.html,"Take me now, baby, here as I am. Hold me close, and try and understand. Desire is hunger is the fire I breathe. Love is a banquet on which we feed. Come on now, try and understand. The way I feel under your command. Take my hand, as the sun descends. They can't hurt you now can't hurt you now, can't hurt you now. Because the night belongs to lovers. Because the night belongs to us. Because the night belongs to lovers. Cause the night belongs to us. Have I a doubt, baby, when I'm alone. Love is a ring a telephone. Love is an angel, disguised as lust. Here in our bed 'til the morning comes. Come on now, try and understand. The way I feel under your command. Take my hand, as the sun descends. They can't hurt you now, can't hurt you now, can't hurt you now. Because the night belongs to lovers. Because the night belongs to us. Because the night belongs to lovers. Because the night belongs to us. With love we sleep,. with doubt the vicious circle turns, and burns. Without you, oh I cannot live,. forgive the yearning burning. I believe it's time to heal to feel,. so take me now, take me now, take me now. Because the night belongs to lovers. Because the night belongs to us. Because the night belongs to lovers. Because the night belongs to us",ENGLISH
/10000-maniacs/,These Are Days,/10000-maniacs/these-are-days.html,"These are. These are days you'll remember. Never before and never since, I promise. Will the whole world be warm as this. And as you feel it,. You'll know it's true. That you - you are blessed and lucky. It's true - that you. Are touched by something. That will grow and bloom in you. These are days you'll remember. When May is rushing over you. With desire to be part of the miracles. You see in every hour. You'll know it's true. That you are blessed and lucky. It's true that you are touched. By something that will grow and bloom in you. These are days. These are the days you might fill. With laughter until you break. These days you might feel. A shaft of light. Make its way across your face. And when you do. You'll know how it was meant to be. See the signs and know their meaning. You'll know how it was meant to be. Hear the signs and know they're speaking. To you, to you",ENGLISH
/10000-maniacs/,A Campfire Song,/10000-maniacs/a-campfire-song.html,"A lie to say, ""O my mountain has coal veins and beds to dig.. 500 men with axes and they all dig for me.""A lie to ssay, ""O my. river where mant fish do swim, half of the catch is mine when you haul. your nets in.""Never will he believe that his greed is a blinding. ray. No devil or redeemer will cheat him. He'll take his gold to. where he's lying cold.. A lie to say, ""O my mine gave a diamond as big as a fist."". But with every gem in his pocket, the jewels he has missed. A lie to. say, ""O my garden is growing taller by the day.""He only eats the. best and tosses the rest away. Never will he be believe that his. greed is a blinding ray. No devil or redeemer can cheat him. he'll. take his gold to where he's lying cold. Six deep in the grave.. Something is out of reach. something he wanted. something is out of reach. he's being taunted. something is out of reach. that he can' beg or steal nor can he buy. his oldest pain. and fear in life. there'll not be time. his oldest pain. and fear in life. there'll not be time. A lie to say ""O my forest has trees that block the sun and. when I cut them down I don't answer to anyone.""No, no, never will he. believe that his greed is a blinding ray no devil or redeemer can. cheat. him. He'll take his gold where he's lying cold..",ENGLISH
/10000-maniacs/,Everyday Is Like Sunday,/10000-maniacs/everyday-is-like-sunday.html,"Trudging slowly over wet sand. Back to the bench where your clothes were stolen. This is a coastal town. That they forgot to close down. Armagedon - come armagedon come armagedon come. Everyday is like sunday. Everyday is silent and grey. Hide on a promanade. Etch on a post card:. How I dearly wish I was not here. In the seaside town. That they forgot to bomb. Come, come nuclear bomb!. Everyday is like sunday. Everyday is silent and grey. Trudging back over pebbles and sand. And a strange dust lands on your hands. (and on your face). Everyday is like sunday. Win yourself a cheap tray. Share some grease tea with me. Everyday is silent and grey",ENGLISH
/10000-maniacs/,Don't Talk,/10000-maniacs/dont-talk.html,"Don't talk, I will listen. Don't talk, you keep your distance. For I'd rather hear some truth tonight. Than entertain your lies,. So take you poison silently. Let me be let me close my eyes. Don't talk, I'll believe it. Don't talk, listen to me instead,. I know that if you think of it,. Both long enough and hard. The drink you drown your troubles. In is the trouble you're in now. Talk talk talk about it,. If you talk as if you care. But when your talk is over. Tilt that bottle in the air,. Tossing back more than your share. Don't talk, I can guess it. Don't talk, well now your restless. And you need somewhere to put the blame. For how you feel inside. You'll look for a close. And easy mark and you'll see me as fair game. Talk talk talk about it,. Talk as if you care. But when your talk is over tilt. That bottle in the air. Tossing back more than your share. You talk talk talk about it,. You talk as if you care. I'm marking every word. And can tell this time for sure,. Your talk is the finest I have heard. So don't talk, I'll be sleeping,. Let me go on dreaming. How your eyes they glow so fiercely. I can tell your inspired. By the name you just chose for me. Now what was it?. O, never mind it. We will talk talk. Talk about this when your head is clear. I'll discuss this in the morning,. But until then you may talk but I won't hear",ENGLISH
/10000-maniacs/,Across The Fields,/10000-maniacs/across-the-fields.html,"Well they left then in the morning, a hundred pairs of wings. In the light moved together in the colors of the morning. I looked to the clouds in the cirrus sky and they'd gone.. Across the marshes, across the fields below.. I fell through the vines and I hoped they would catch me below.. If only to take me with them there,. Tell me the part that shinesIn your heart on the wind.. And the reeds blew in the morning.. Take me along to the places. You've gone when my eyes looked away.. Tell me the song that you sing in the trees in the dawning.. Tell me the part that shines in your heart. And the rays of love forever,. Please take me there..",ENGLISH
/10000-maniacs/,Planned Obsolescence,/10000-maniacs/planned-obsolescence.html,[ music: Dennis Drew/lyric: Natalie Merchant ]. . science. is truth for life. watch religion fall obsolete. science. will be truth for life. technology as nature. science. truth for life. in fortran tongue the. answer. with wealth and prominence. man so near perfection. possession. it's an absence of interim. secure no demurrer. defense against divine. defense against his true. image. human conflict number five. discovery. dissolved all illusion. mystery. destroyed with conclusion. and illusion never restored. any modern man can see. that religion is. obsolete. piety. obsolete. ritual. obsolete. martyrdom. obsolete. prophetic vision. obsolete. mysticism. obsolete. commitment. obsolete. sacrament. obsolete. revelation. obsolete.,ENGLISH
/10000-maniacs/,Rainy Day,/10000-maniacs/rainy-day.html,"On bended kneeI've looked through every window then.. Touched the bottom, the night a sleepless day instead. A day when love came,came easy like what's lost now found.. Beneath a blinding light that would surround.. We were without, in doubt. We were about saving for a rainy day.. I crashed through mirrors,. I crashed through floors of laughter then.. In a blind scene, no ties would moor us to this room.. A day when love came, came easy like what's lost now found.. And you would save me, and I held you like you were my child.. If I were you, defiant you, alone upon a troubled way.. I would send my heart to you. To save it for a rainy day..",ENGLISH
/10000-maniacs/,Anthem For Doomed Youth,/10000-maniacs/anthem-for-doomed-youth.html,For whom do the bells toll. When sentenced to die. The stuttering rifles. Will stifle the cry. The monstrous anger. The fear's rapid rattle. A desert inferno. Kids dying like cattle. Don't tell me. We're not prepared. I've seen today's marine. He's eighteen and he's eager. He can be quite mean. No mock'ries for them. No prayers or bells. The demented choirs. The wailing of shells. The boys holding candles. On untraveled roads. The fear spreads like fire. As shrapnel explodes. I think it's wrong. To conscript our youth. Against their will. When plenty of our citizenry. Really like to kill. What sign posts will lead. To armageddon's fires. What bugles will call them. From crowded grey shires. The women sit quiet. With death on their minds. A slow dusk descending. The drawing of blinds. Make the hunters all line up. It's their idea of fun. And let those be forgiven. Who never owned a gun. Was it him or me. Or the wailing of the dead. The laughing soldiers. Cast their lots. And you can cut the dread.,ENGLISH
/10000-maniacs/,All That Never Happens,/10000-maniacs/all-that-never-happens.html,"She walks alone on the brick lane,. the breeze is blowing.. A year had changed her forever,. just like her grey home.. He used to live so close here,. we'd look for places I can't remember.. The world was safe when she knew him,. she tried to hold him, hold on forever.. For all that never happens and all that never will be,. a candle burning for the love we seldom keep.. The earth was raw in her fingers,. she overturned it.. Considered planting some flowers,. they wouldn't last long,. no one to tend them.. It's funny how these things go,. you were the answer to all the questions.. The memories made her weary,. she shuddered slowly,. she didn't want to.. As a distant summer he began to whisper,. and threw a smile her way.. She looked into the glass,. liquid surface showing that they were melding,. together present past.. So where can I go from here?. The color fading,. he didn't answer.. She felt him slip from her vision.. She tried to hold him, hold on forever.. So close forever,. in a silent frozen sleep..",ENGLISH
/10000-maniacs/,Back O'The Moon,/10000-maniacs/back-o-the-moon.html,Jenny. Jenny you don't know the nights I hide. below a second story room. to whistle you down. the man who's let to divvy up. time is a miser. he's got a silver coin. only lets it shine for hours. while you sleep it away. there's one rare and odd style of living. part only known to the everybody Jenny. a comical where's the end parade. of the sort people here would think unusual. Jenny. tonight upon the mock brine of a Luna Sea. far off we sail on to Back O'The Moon. Jenny. Jenny you don't know the days I've tried. telling backyard tales. so to maybe amuse. o your mood is never giddy. if you smile I'm delighted. but you'd rather pout. such a lazy child. you dare fold your arms. tisk and say that I lie. there's one rare and odd style of thinking. part only known to the everybody Jenny. the small step and giant leap takers. got the head start in the race toward it. Jenny. tonight upon the mock brine of a Luna Sea. far off we sail on to the Back O'The Moon. that was a sigh. but not meant to envy you. when your age was mine. some things were sworn true. morning would come. and calendar pages had. new printed seasons on. their opposite sides. Jenny. Jenny you don't know the nights I hide. below a second story room. to whistle you down. o the man who's let to divvy up. time is a miser. he's got a silver coin. lets it shine for hours. while you sleep it away. there's one rare and odd style of living. part only known to the everybody Jenny. out of tin ships jump the bubble head boys. to push their flags into powdered soils and cry. no second placers. no smart looking geese in bonnets. dance with pigs in high button trousers. no milk pail for the farmer's daughter. no merry towns of sweet walled houses. here I've found. Back O' the Moon. not here. I've found. Back O' the Moon,ENGLISH

让我们首先导入必要的库并准备数据。 我建议在这个项目中使用 Google Colab,因为访问 GPU 会让事情变得更快。

importpandas as pd
from transformers importGPT2LMHeadModel, GPT2Tokenizer
importnumpy as np
importrandom
importtorch
from torch.utils.data importDataset, DataLoader
from transformers importGPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm importtqdm, trange
importtorch.nn.functional as F
importcsv

### Prepare data
lyrics = pd.read_csv('lyrics-data.csv')
lyrics = lyrics[lyrics['Idiom']=='ENGLISH']

#Only keep popular artists, with genre Rock/Pop and popularity high enough
artists = pd.read_csv('artists-data.csv')
artists = artists[(artists['Genre'].isin(['Rock'])) & (artists['Popularity']>5)]
df = lyrics.merge(artists[['Artist', 'Genre', 'Link']], left_on='ALink', right_on='Link', how='inner')
df = df.drop(columns=['ALink','SLink','Idiom','Link'])

#Drop the songs with lyrics too long (after more than 1024 tokens, does not work)
df = df[df['Lyric'].apply(lambda x: len(x.split(' ')) < 350)]

#Create a very small test set to compare generated text with the reality
test_set = df.sample(n = 200)
df = df.loc[~df.index.isin(test_set.index)]

#Reset the indexes
test_set =test_set.reset_index()
df =df.reset_index()

#For the test set only, keep last 20 words in a new column, then remove them from original column
test_set['True_end_lyrics'] = test_set['Lyric'].str.split().str[-20:].apply(' '.join)
test_set['Lyric'] = test_set['Lyric'].str.split().str[:-20].apply(' '.join)

从第 26 行和第 34-35 行可以看出,我创建了一个小型测试集,其中删除了每首歌曲的最后 20 个单词。 这将允许我将生成的文本与实际文本进行比较,以查看模型的性能如何。

二、创建数据集

为了在我们的数据上使用 GPT-2,我们还需要做一些事情。 我们需要对数据进行标记化,这是将字符序列转换为标记的过程,即将一个句子分成单词。
我们还需要确保每首歌曲最多尊重 1024 个令牌。
SongLyrics 类将在训练期间为我们完成原始数据帧中的每首歌曲。

classSongLyrics(Dataset):  
    def __init__(self, control_code, truncate=False, gpt2_type="gpt2", max_length=1024):

        self.tokenizer =GPT2Tokenizer.from_pretrained(gpt2_type)
        self.lyrics =[]

        for row in df['Lyric']:
          self.lyrics.append(torch.tensor(
                self.tokenizer.encode(f"<|{control_code}|>{row[:max_length]}<|endoftext|>")
            ))               
        iftruncate:
            self.lyrics = self.lyrics[:20000]
        self.lyrics_count =len(self.lyrics)
        
    def __len__(self):
        returnself.lyrics_count

    def __getitem__(self, item):
        returnself.lyrics[item]
    
dataset = SongLyrics(df['Lyric'], truncate=True, gpt2_type="gpt2")    

三、训练模型

我们现在可以导入预训练的 GPT-2 模型以及分词器。 此外,正如我之前提到的,GPT-2 非常庞大。 如果您尝试在计算机上使用它,很可能会遇到一堆 CUDA 内存不足错误。
可以使用的替代方法是累积梯度。
这个想法很简单,在调用优化以执行梯度下降步骤之前,它将对几个操作的梯度求和。 然后,它将总数除以累积的步骤数,以获得训练样本的平均损失。 这意味着计算要少得多。

#Get the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

#Accumulated batch size (since GPT2 is so big)
defpack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor isNone:
        returnnew_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] >max_seq_len:
        returnpacked_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

现在,最后,我们可以创建训练函数,使用我们所有的歌词来微调 GPT-2,以便它可以预测未来的质量诗句。

deftrain(
    dataset, model, tokenizer,
    batch_size=16, epochs=5, lr=2e-5,
    max_seq_len=400, warmup_steps=200,
    gpt2_type="gpt2", output_dir=".", output_prefix="wreckgar",
    test_mode=False,save_model_on_epoch=False,
):
    acc_steps = 100device=torch.device("cuda")
    model =model.cuda()
    model.train()

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler =get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1)

    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
    loss=0
    accumulating_batch_count =0
    input_tensor =None

    for epoch inrange(epochs):

        print(f"Training epoch {epoch}")
        print(loss)
        for idx, entry intqdm(enumerate(train_dataloader)):
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)

            if carry_on and idx != len(train_dataloader) - 1:
                continue
            input_tensor =input_tensor.to(device)
            outputs = model(input_tensor, labels=input_tensor)
            loss =outputs[0]
            loss.backward()

            if (accumulating_batch_count % batch_size) ==0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1input_tensor =None
        ifsave_model_on_epoch:
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
            )
    return model

随意使用各种超参数(批量大小、学习率、时期、优化器)。
然后,最后,我们可以训练模型。

model = train(dataset, model, tokenizer)

使用 torch.save 和 torch.load,您还可以保存经过训练的模型以备将来使用。

四、歌词生成

是时候使用我们全新的微调模型来生成歌词了。 通过使用以下两个函数,我们可以为测试数据集中的所有歌曲生成歌词。 请记住,我已经删除了每首歌的最后 20 个单词。 现在,对于给定的歌曲,我们的模型将查看他拥有的歌词,并想出歌曲的结尾应该是什么。

defgenerate(
    model,
    tokenizer,
    prompt,
    entry_count=10,
    entry_length=30, #maximum number of words
    top_p=0.8,
    temperature=1.,
):
    model.eval()
    generated_num =0
    generated_list =[]

    filter_value = -float("Inf")

    with torch.no_grad():

        for entry_idx intrange(entry_count):

            entry_finished =False
            generated =torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)

            for i inrange(entry_length):
                outputs = model(generated, labels=generated)
                loss, logits = outputs[:2]
                logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

                sorted_indices_to_remove = cumulative_probs >top_p
                sorted_indices_to_remove[..., 1:] =sorted_indices_to_remove[
                    ..., :-1].clone()
                sorted_indices_to_remove[..., 0] =0

                indices_to_remove =sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] =filter_value

                next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
                generated = torch.cat((generated, next_token), dim=1)

                if next_token in tokenizer.encode("<|endoftext|>"):
                    entry_finished =True

                ifentry_finished:

                    generated_num = generated_num + 1
                    output_list =list(generated.squeeze().numpy())
                    output_text =tokenizer.decode(output_list)
                    generated_list.append(output_text)
                    break
            
            if notentry_finished:
              output_list =list(generated.squeeze().numpy())
              output_text = f"{tokenizer.decode(output_list)}<|endoftext|>"generated_list.append(output_text)
                
    returngenerated_list

#Function to generate multiple sentences. Test data should be a dataframe
deftext_generation(test_data):
  generated_lyrics =[]
  for i inrange(len(test_data)):
    x = generate(model.to('cpu'), tokenizer, test_data['Lyric'][i], entry_count=1)
    generated_lyrics.append(x)
  returngenerated_lyrics

#Run the functions to generate the lyrics
generated_lyrics = text_generation(test_set)

generate 函数为整个测试数据帧准备生成,而 text_generation 实际上是这样做的。
在第 6 行中,我们指定了一代的最大长度。我将其保留为 30,但这是因为标点符号很重要,稍后我将删除最后几个单词,以确保生成在句子末尾完成。
另外两个超参数值得一提:
温度(第 8 行)。它用于缩放生成给定单词的概率。因此,高温迫使模型做出更原始的预测,而较小的温度则使模型不会偏离主题。
Top p 过滤(第 7 行)。该模型将按降序对单词概率进行排序。然后,它会将这些概率加起来为 p,同时删除其他单词。这意味着模型只保留
最相关的词概率,但不仅保留最好的一个,因为给定一个序列,可以有多个词是合适的。
在下面的代码中,我只是清理生成的文本,确保它在句子的末尾(而不是在句子的中间)结束,并将其存储在测试数据集中的新列中。

#Loop to keep only generated text and add it as a new column in the dataframe
my_generations=[]

for i inrange(len(generated_lyrics)):
  a = test_set['Lyric'][i].split()[-30:] #Get the matching string we want (30 words)
  b = ' '.join(a)
  c = ' '.join(generated_lyrics[i]) #Get all that comes after the matching string
  my_generations.append(c.split(b)[-1])

test_set['Generated_lyrics'] =my_generations


#Finish the sentences when there is a point, remove after that
final=[]

for i inrange(len(test_set)):
  to_remove = test_set['Generated_lyrics'][i].split('.')[-1]
  final.append(test_set['Generated_lyrics'][i].replace(to_remove,''))

test_set['Generated_lyrics'] = final

五、效果评估

有很多方法可以评估生成文本的质量。 最流行的指标称为 BLEU。 该算法输出 0 到 1 之间的分数,具体取决于生成的文本与现实的相似程度。 1 分表示生成的每个单词都存在于真实文本中。
这是评估生成歌词的 BLEU 分数的代码。

#Using BLEU score to compare the real sentences with the generated ones
importstatistics
from nltk.translate.bleu_score importsentence_bleu

scores=[]

for i inrange(len(test_set)):
  reference = test_set['True_end_lyrics'][i]
  candidate = test_set['Generated_lyrics'][i]
  scores.append(sentence_bleu(reference, candidate))

statistics.mean(scores)

我们获得了 0.685 的平均 BLEU 分数,这是相当不错的。 相比之下,没有任何微调的 GPT-2 模型的 BLEU 得分为 0.288。
但是,BLEU 有其局限性。 它最初是为机器翻译而创建的,只查看用于确定生成文本质量的词汇。 这对我们来说是个问题。 事实上,有可能生成使用与现实完全不同的词的高质量诗句。
这就是为什么我会对模型的性能做一个主观的评估。 为此,我创建了一个小型 Web 界面(使用 Dash)。 该代码可在我的 Github 存储库中找到。
界面的工作方式是您向应用程序提供一些输入词。 然后,模型将使用它来预测接下来的几节经文应该是什么。 以下是一些示例结果。

NLP(二十六):如何微调 GPT-2 以生成文本第1张

给定黑色输入序列,红色是 GPT-2 模型预测的结果。 你会看到它已经成功地生成了有意义的诗句,并且尊重了之前的上下文! 此外,它会生成长度相似的句子,这在保持歌曲节奏方面非常重要。 在这方面,输入文本中的标点符号在生成歌词时是绝对必要的。

六、结论

正如文章所示,通过针对特定数据对 GPT-2 进行微调,可以相当轻松地生成与上下文相关的文本。
对于歌词生成,该模型可以生成符合上下文和句子所需长度的歌词。 当然,可以对模型进行改进。 例如,我们可以强制它生成押韵的诗句,这在编写歌词时通常是必要的。
非常感谢阅读,希望能帮到你!
可以在此处找到包含所有代码和模型的存储库:https://github.com/francoisstamant/lyrics-generation-with-GPT2

https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272

免责声明:文章转载自《NLP(二十六):如何微调 GPT-2 以生成文本》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇WindowsServer2008R2清理winsxs目录关于使用navicat将mdb文件导入mysql数据库下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

相关文章

如何用webpack搭建vue项目?本文案例详解

前言:都2020年了,感觉是时候该学一波webpack了,趁着最近有时间,就学了一下,等把官网上的webpack结构和模块大概看了一遍之后,就感觉可以开始搭个项目实战一下了,从0开始,一步步记录下来 使用版本: webpack4.x 1.包含插件和loader * html: html-webpack-plugin clean-webpack...

MySQL 千万 级数据量根据(索引)优化 查询 速度

一、索引的作用索引通俗来讲就相当于书的目录,当我们根据条件查询的时候,没有索引,便需要全表扫描,数据量少还可以,一旦数据量超过百万甚至千万,一条查询sql执行往往需要几十秒甚至更多,5秒以上就已经让人难以忍受了。 提升查询速度的方向一是提升硬件(内存、cpu、硬盘),二是在软件上优化(加索引、优化sql;优化sql不在本文阐述范围之内)。 能在软件上解决的...

linux下find查找命令用法

转自http://www.jb51.net/os/RedHat/1307.html Linux下find命令在目录结构中搜索文件,并执行指定的操作。Linux下find命令提供了相当多的查找条件,功能很强大。由于find具有强大的功能,所以它的选项也很多,其中大部分选项都值得我们花时间来了解一下。即使系统中含有网络文件系统( NFS),find命令在该文件...

用 16G 内存存放 30亿数据(Java Map)转载

在讨论怎么去重,提出用 direct buffer 建 btree,想到应该有现成方案,于是找到一个好东西: MapDB - MapDB : http://www.mapdb.org/ 以下来自:kotek.net : http://kotek.net/blog/3G_map 3 billion items in Java Map with 16 GB R...

docker API接口service update错误记录 error while removing network:…

一、问题 error while removing network:… 创建srvice的参数的时候,如果设置了network参数(接口POST:service/create) [{ "Image": "nginx", "ImageVersion": "1.17", "ServiceName": "test-chow", "Replicas...

elixir mix开发入门

备注:  简单使用mix 进行项目的生成,同时添加docker 构建支持 1. 生成项目 mix new mydemoproject 输出信息如下: * creating README.md * creating .formatter.exs * creating .gitignore * creating mix.exs * cre...