serhii.net

In the middle of the desert you can say anything you want

24 Oct 2023

Master thesis task CBT

Basics

TODOs

  • deduplicate options etc. by lemma (синку-син-??)
  • gender of the noun giving hints!
  • the bits below

Issues/Problems/TODOs

Multiple possible answers

— Синку, як ти мене знайшов? — запитав батько. — Коли вже так, віднеси обід до джерела, я туди прийду і поїмо
QUESTION:	— Ні, батьку, — сказав ______ .
OPTIONS:	{'хлопець', 'хлопчик', 'син', 'цар'}

Complex structures

 Будь ______ , пообідайте з нами!', options={'ласка', 'ножа', 'жаль', 'візир', 'дозволь'}, answer='ласка')

Unknown/unknowable answer

│ context = 'Ein Mann und eine Frau hatten einen goldenen Ring. Das war ein     │
│           Glücksring, und wer ihn besaß, hatte immer genug zu leben. Sie      │
│           wußten es aber nicht und verkauften den Ring für wenig Geld. Kaum   │
│           war der Ring aus dem Hause, da wurden sie immer ärmer und wußten    │
│           schließlich nicht mehr, woher sie genug zum Essen nehmen sollten.   │
│           Sie hatten auch einen Hund und eine Katze, die mußten mit ihnen     │
│           Hunger leiden. Da ratschlagten die Tiere miteinander, wie sie den   │
│           Leuten wieder zu ihrem alten Glück verhelfen könnten.'              │
I'll be using "Label all tasks"  then it would show me the next CBT after I submit. 

Keybindings are nice for classifying text.

When importing  the things, I should try to do text highlighting or whatever to make it easier visually.

Code notes

Multiple hard options

Sometimes it gives multiple options

[
    Parse(
        word='корів',
        tag=OpencorporaTag('NOUN,inan plur,gent'),
        normal_form='кір',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'корів', 498, 11),)
    ),
    Parse(
        word='корів',
        tag=OpencorporaTag('NOUN,anim plur,gent'),
        normal_form='корова',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'корів', 2063, 8),)
    ),
    Parse(
        word='корів',
        tag=OpencorporaTag('NOUN,anim plur,accs'),
        normal_form='корова',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'корів', 2063, 10),)
    )
]

I can find the right one:

2023-11-29 11:46

2-3-4 and multiple plurals

(Pdb++) t.tag.numeral_agreement_grammemes(1)
{'sing', 'nomn'}
(Pdb++) t.tag.numeral_agreement_grammemes(2)
{'sing', 'gent'}
(Pdb++) t.tag.numeral_agreement_grammemes(3)
{'sing', 'gent'}
(Pdb++) t.tag.numeral_agreement_grammemes(4)
{'sing', 'gent'}
(Pdb++) t.tag.numeral_agreement_grammemes(5)
{'plur', 'gent'}
(Pdb++) t.tag.numeral_agreement_grammemes(6)
{'plur', 'gent'}
(Pdb++) self.morph.parse("стіл")[1].inflect({'plur'}).tag.number
'plur'
(Pdb++) self.morph.parse("стіл")[1].tag.number
(Pdb++)

Yes:

(Pdb++) pp self.morph.parse("столи")[1].lexeme
[Parse(word='стіл', tag=OpencorporaTag('NOUN,inan masc,nomn'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'стіл', 2710, 0),)),
 Parse(word='стола', tag=OpencorporaTag('NOUN,inan masc,gent'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'стола', 2710, 1),)),
 Parse(word='столу', tag=OpencorporaTag('NOUN,inan masc,gent'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столу', 2710, 2),)),
 Parse(word='столові', tag=OpencorporaTag('NOUN,inan masc,datv'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столові', 2710, 3),)),
 Parse(word='столу', tag=OpencorporaTag('NOUN,inan masc,datv'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столу', 2710, 4),)),
 Parse(word='стіл', tag=OpencorporaTag('NOUN,inan masc,accs'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'стіл', 2710, 5),)),
 Parse(word='стола', tag=OpencorporaTag('NOUN,inan masc,accs'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'стола', 2710, 6),)),
 Parse(word='столом', tag=OpencorporaTag('NOUN,inan masc,ablt'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столом', 2710, 7),)),
 Parse(word='столі', tag=OpencorporaTag('NOUN,inan masc,loct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столі', 2710, 8),)),
 Parse(word='столові', tag=OpencorporaTag('NOUN,inan masc,loct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столові', 2710, 9),)),
 Parse(word='столу', tag=OpencorporaTag('NOUN,inan masc,loct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столу', 2710, 10),)),
 Parse(word='столе', tag=OpencorporaTag('NOUN,inan masc,voct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столе', 2710, 11),)),
 Parse(word='столи', tag=OpencorporaTag('NOUN,inan plur,nomn'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столи', 2710, 12),)),
 Parse(word='столів', tag=OpencorporaTag('NOUN,inan plur,gent'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столів', 2710, 13),)),
 Parse(word='столам', tag=OpencorporaTag('NOUN,inan plur,datv'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столам', 2710, 14),)),
 Parse(word='столи', tag=OpencorporaTag('NOUN,inan plur,accs'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столи', 2710, 15),)),
 Parse(word='столами', tag=OpencorporaTag('NOUN,inan plur,ablt'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столами', 2710, 16),)),
 Parse(word='столах', tag=OpencorporaTag('NOUN,inan plur,loct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столах', 2710, 17),)),
 Parse(word='столи', tag=OpencorporaTag('NOUN,inan plur,voct'), normal_form='стіл', score=1.0, methods_stack=((DictionaryAnalyzer(), 'столи', 2710, 18),))]

chatGPT

..may help.

Format

The LMentry example generated dataset is nice as example: lmentry/data/all_words_from_category.json at main · aviaefrat/lmentry Not all of it needs code and regexes! lmentry/data/bigger_number.json at main · aviaefrat/lmentry

More thoughts

Using GPT for filtering

  • I can ask gpt4 if it gets it, if it does - assume the individual instance is good enough

    • Maybe just parse the entire generated dataset through the API for that!
      • Approach: generate a lot of garbage, filter through gpt4 for the less garbage bits
  • 3.5-turbo can’t do this! 2023-12-11-212155_1149x587_scrot.png

  • but gpt4 can!

playing more with the code

spacy’s largest model is not perfect either:

(Pdb++) doc[10:17]
жило сто корів і тридцять кіз,
(Pdb++) doc[15]
кіз
(Pdb++) doc[15].norm_
'кіз'
(Pdb++) doc[15].morph
Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur

Вік живи вік учись… https://goroh.pp.ua/%D0%A2%D0%BB%D1%83%D0%BC%D0%B0%D1%87%D0%B5%D0%BD%D0%BD%D1%8F/%D0%BA%D1%96%D0%B7

Oh wait there’s also some kind of transformer model python3 -m spacy download uk_core_news_trf:
https://spacy.io/models/uk#uk_core_news_trf, based on ukr-models/xlm-roberta-base-uk · Hugging Face. It doesn’t get it either 322 mb

Solutions

  • I can literally somehow hardcode that all animals are animals? Avoid using things like кіз?
  • For pronouns and stuff I can do manual dictionary stuff, like її -> його etc.

// aside - this feels so satisfying to look at! 2023-12-11-212829_605x452_scrot.png

More generating stories with ChatGPT

GPT4 and graph structures

GPT4 is really good at writing stories based on graph-like descriptions which don’t use chunks found on Google and which aren’t continued in the same way by ChatGPT itself when I give it the first couple of sentences.

Even more complex ones with multiple characters: https://chat.openai.com/share/d42debd7-d80a-4030-ac7c-a18ecb1416a9

This is based on graph descriptions of stories generated by ChatGPT itself with the already mentioned prompt of ‘do easop but in graph form and not about grapes’.

231214-1503 Asking ChatGPT to make its own prompts is a superpower, now definitely:

“Generate an abstract graph structure for a narrative involving multiple animate characters. The graph should include nodes for entities, goals, challenges, interactions, outcomes, and moral lessons. Each node should abstractly represent the core elements of the story, focusing on thematic and moral aspects rather than the literal narrative. The format should be similar to a semantic web ontology, emphasizing relationships and abstract concepts. Please provide the graph in a Python dictionary format, with complexity and depth akin to an advanced semantic network.”

These will help me:

I think I have gold. I either make it describe ATU stories as generic graphs and then make it generate stories matching these graphs, or make it generate graphs of existing stories w/ replacements and make it generate stories based on these changed (but still concrete) graphs.

Bonus points: find some parseable ATU index online and pick the ATU indexes that have no Ukrainian or Russian folktales associated with it.

 > ag ath | sed "s/.*: \(.*\)/\\1/g" > motivs_present

gave me the list of motifs with slavic tales.

I think this is the best prompt (https://chat.openai.com/share/32f40f4e-3973-42cf-8621-2eb8ebb94e76):

Generate an abstract graph structure for a narrative involving multiple animate characters. The graph should include nodes for entities, goals, challenges, interactions, outcomes, and moral lessons. Each node should abstractly represent the core elements of the story, focusing on thematic and moral aspects rather than the literal narrative. The format should be similar to a semantic web ontology, emphasizing relationships and abstract concepts. Please provide the graph in a Python dictionary format, with complexity and depth akin to an advanced semantic network. It should be less than 15 lines long.

The story is ATU motif 1: “the theft of fish”

Trivially editable to be about not-fish.

narrative_graph = {
    "Entities": {
        "Thief": {"Characteristics": ["Cunning", "Resourceful"], "Role": "Protagonist"},
        "Fish": {"Characteristics": ["Valuable", "Symbolic"], "Role": "Object"},
        "Owner": {"Characteristics": ["Victimized", "Unaware"], "Role": "Antagonist"}
    },
    "Goals": {
        "Thief": "Steal Fish",
        "Owner": "Protect Property"
    },
    "Challenges": {
        "Thief": "Avoid Detection",
        "Owner": "Secure Property"
    },
    "Interactions": {
        ("Thief", "Fish"): "Theft",
        ("Thief", "Owner"): "Avoidance",
        ("Owner", "Fish"): "Ownership"
    },
    "Outcomes": {
        "Immediate": "Successful Theft",
        "Long-term": "Loss of Trust"
    },
    "Moral Lessons": {
        "Actions Have Consequences",
        "Importance of Trust",
        "Greed Leads to Loss"
    }
}

Yes, this is the solution. Wooho! I’ll prolly manually pick some good ATUs and craft the different JSONs descriptions from which multiple stories will be generated.

Also, gpt3.5 creates faster and very comparable quality graphs than gpt4. This is 3.5:

{
    "Entities": [
        {
            "Name": "Thief",
            "Type": "Character"
        },
        {
            "Name": "Fisherman",
            "Type": "Character"
        },
        {
            "Name": "Fish",
            "Type": "Object"
        }
    ],
    "Goals": [
        {
            "Entity": "Thief",
            "Goal": "Steal Fish"
        },
        {
            "Entity": "Fisherman",
            "Goal": "Protect Fish"
        }
    ],
    "Challenges": [
        {
            "Challenge": "Conceal Theft",
            "Entity": "Thief"
        },
        {
            "Challenge": "Catch Thief",
            "Entity": "Fisherman"
        }
    ],
    "Interactions": [
        {
            "Interaction": "Theft",
            "Entities": ["Thief", "Fisherman"]
        }
    ],
    "Outcomes": [
        {
            "Outcome": "Thief Caught",
            "Entities": ["Thief"]
        },
        {
            "Outcome": "Fish Returned",
            "Entities": ["Fisherman"]
        }
    ],
    "MoralLessons": [
        {
            "Lesson": "Stealing leads to consequences",
            "Entities": ["Thief"]
        },
        {
            "Lesson": "Protecting what's yours is important",
            "Entities": ["Fisherman"]
        }
    ]
}

2023-12-15 15:15 231010-1003 Masterarbeit Tagebuch

[[231010-1003 Masterarbeit Tagebuch#231024-1704 Master thesis task CBT]] Results of discussion: - general idea about label-studio to filter bad ones is solid - -> what are the chances that a LM gets the answer right randomly? - some examples are bad not because impossible, but because no context needed to narrow down solutions - e.g. in the example below, it’s clearly ‘vater’ or ‘mutter’ regardless of what the context says: 2023-12-15-151150_950x287_scrot.png - … leading to a probability not of 1/4(..10) but 1/2 - one way to filter out such bad examples is to get a LM to solve the task without providing context, or even better - look at the distribution of probabilities over the answers and see if some are MUCH more likely than the others - Issue with 2-3-4 plurals: I can just create three classes of nouns, singular, 2-3-4, and >=5 - don’t forget to discuss the morphology complexities in the masterarbeit - Conveying the issues in English is hard, but I can (for a given UA example) - provide the morphology info for the English words - provide a third German translation

  • What if iI use the same approach I did in LMentry, with capitalizing the words I feel are worth replacing and then doing regex/templating magic?

    • I’d use the same story, highlight the needed words by double clicking, and get around ~10 replacements from each story, and have to do 100 of them for it to start making sense…
  • I should refactor the code that given a specific word and options replaces the word with a blank and puts the options in the correct morphology, that way I will be able to use it as a more generic thing.

Other options for annotation

Alternatives

Similar tasks:

  • NarrativeQA!
  • Story clozze test
  • CBeebes
  • babl?..

In a certain sense, The Winograd Schema Challenge1 is trying to do basically the same thing as I am and describes many of the same pitfalls. WinoGrande2 is the same but larger and formulated as a fill-in-the-blanks thing and the paper contains info about how they used things like Roberta etc. to benchmark on it — input formats and all that.

Performance of existing models

"Одного разу селянин пішов у поле орати. Дружина зібрала йому обід. У селянина був семирічний син. Каже він матері: — Мамо, дай-но я віднесу обід батькові. — Синку, ти ще малий, не знайдеш батька, — відповіла мати. — Не бійтеся, матінко. Дорогу я знаю, обід віднесу. Мати врешті погодилась, зав’язала хліб у вузлик, приладнала йому на спину, вариво налила у миску, дала синові в ______ та й відправила у поле. Малий не заблукав, доніс обід батькові. — Синку, як ти мене знайшов? — запитав батько. — Коли вже так, віднеси обід до джерела, я туди прийду і поїмо. — Ні, батьку, — сказав син."  
Замість _______ має бути:  
"цар",  
"рибки",  
"хлопця",  
"сина",  
"руки"  
?

On Perplexity Labs:

  • mixtral8x7b-instruct is correct and logical
  • codellama34b-instruct is correct
  • pplx70b-chat fails
  • llamab70b-chat fails

New idea

2024-01-11 12:58

  • Find false friends by comparing word positions in RU and UA embeddings!
  • Interference can be measured by higher sensitivity in RU native speaker UKR language VS other L1 native speakers

Back to pymorphy morphology

2024-02-07 20:55

Hopefully last problem of this type.

(Pdb++) x
Грізний
(Pdb++) x.morph
Animacy=Anim|Case=Gen|Gender=Masc|NameType=Sur|Number=Sing
(Pdb++) print(self.disamb.pymorphy_analyzer.parse(x.text))
[
    Parse(
        word='грізний',
        tag=OpencorporaTag('ADJF,compb masc,nomn'),
        normal_form='грізний',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'грізний', 76, 0),)
    ),
    Parse(
        word='грізний',
        tag=OpencorporaTag('ADJF,compb masc,accs'),
        normal_form='грізний',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'грізний', 76, 4),)
    ),
    Parse(
        word='грізний',
        tag=OpencorporaTag('ADJF,compb masc,voct'),
        normal_form='грізний',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'грізний', 76, 8),)
    )
]

This happens for only some of them e.g. Швидкий is sometimes OK:

(Pdb++) x
Швидкий
(Pdb++) x.morph
Animacy=Anim|Case=Nom|Gender=Masc|NameType=Sur|Number=Sing
(Pdb++) print(self.disamb.pymorphy_analyzer.parse(x.text))
[
    Parse(
        word='швидкий',
        tag=OpencorporaTag('ADJF,compb masc,nomn'),
        normal_form='швидкий',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'швидкий', 76, 0),)
    ),
    Parse(
        word='швидкий',
        tag=OpencorporaTag('ADJF,compb masc,accs'),
        normal_form='швидкий',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'швидкий', 76, 4),)
    ),
    Parse(
        word='швидкий',
        tag=OpencorporaTag('ADJF,compb masc,voct'),
        normal_form='швидкий',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'швидкий', 76, 8),)
    ),
    Parse(
        word='швидкий',
        tag=OpencorporaTag('NOUN,anim masc,nomn'),
        normal_form='швидкий',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'швидкий', 95, 0),)
    ),
    Parse(
        word='швидкий',
        tag=OpencorporaTag('NOUN,anim masc,voct'),
        normal_form='швидкий',
        score=1.0,
        methods_stack=((DictionaryAnalyzer(), 'швидкий', 95, 7),)
    )
]

(and then my get_with_disambiguation works)

What can I do?

  • Throw stories away that use such names (and update the prompts)
  • Find a way to cast it into a noun
    • Do some hack based on this
    • They are nouns-formed-from-adjectives, ergo they’ll prolly inflect like the adjectives themselves
    • => Then I inflect adjectives!

I’ll try the last one.

['кравчиня', 'грізний', 'звір', 'швидкий', 'лев', 'грізного']

Only one is strictly speaking a female noun, bad example.

['кравчиня']

BUT I don’t want to inflect them by gender, because the lion Грізний != Грізна.

ALSO lemmas are a bundle of joy I forgot about.

(Pdb++) x1,x1.lemma_,x1.morph
(Повільна, 'повільний', Case=Nom|Degree=Pos|Gender=Fem|Number=Sing)
(Pdb++) x2,x2.lemma_,x2.morph
(Грізного, 'грізного', Animacy=Anim|Case=Gen|Gender=Masc|NameType=Sur|Number=Sing)
  • What is degree for a noun?
  • Degree
    • Pos is first degree, like young man.
    • ChatGPT says that it’s spacy recognizing the adjectival source of the noun and givin info about the original adjective.
      • “Common in morphologically rich languages”
      • Alright
(Pdb++) [v for k,v in lemmas.items()]
[жителі, власником, заєць, Швидкий, кравчиня, працівницею, левів, Грізний, Грізному, Грізного, звірів, конкуренти, лисиця, Хитра, вовк, Звірі,пліткам, Злий, конкурентки]
[['Masc'], ['Masc'], ['Masc'], ['Masc'], ['Fem'], ['Fem'], ['Masc'], ['Masc'], ['Masc'], ['Masc'], ['Masc'], ['Masc'], ['Fem'], ['Fem'], ['Masc'], [], ['Masc'], ['Masc'], ['Fem']]

Looking even deeper — spacy doesn’t get the more frequent names just as well. And counting the number of capitalized occurrences to build a dictionary etc. is not worth the effort. Giving up.

Also, how interesting:

(Pdb++) x
Шакал
(Pdb++) x.morph
Animacy=Anim|Case=Gen|Gender=Fem|NameType=Sur|Number=Sing|Uninflect=Yes
# lowercase version is interesting as well, inan ins?
Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing

Also looking at “Лео” I realize that pymorphy3 is really much better. I sees it as a NOUN/Name, as opposed to UNKN. Is it too late?

Switched to pymorphy3, except two weird words incl. Жаба where there’s no case, I can see no difference. Left it so.

Hopefully last challenges

Дієприслівники GRND

What spacy sees as VERB pymorphy sees as GRND:

> cand
відчувши
> cand.morph
Aspect=Perf|Tense=Past|VerbForm=Conv
> cand.pos_
'VERB'

> self.disamb.get_with_disambiguation(cand).tag
OpencorporaTag('GRND,perf')

  1. <_(@winograd) “The winograd schema challenge” (2012) / Hector Levesque, Ernest Davis, Leora Morgenstern: z / / _> ↩︎

  2. <_(@Sakaguchi2019) “WinoGrande: An Adversarial Winograd Schema Challenge at Scale” (2019) / Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi: z / http://arxiv.org/abs/1907.10641 / _> ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.