LMdiff: A Visual Diff Tool to Compare Language Models

code demo

Comment: Would be interesting to use the tool to drill on language model memorizations

Notes: visualization by compares internal states of language models to see the differences of the inferenced results and how the distributino differs.

LightTag: Text Annotation Platform

website

Comment: Do not see significant advantage over Prodigy, Doccano, or Label Studio. Doesn’t have active learning hooks either.

Notes: Another text annotation tool. Has entity relationships diagram, infinite scrolling, batch action.

Thermostat: A Large Collection of NLP Model Explanations and Analysis Tools

paper presentation demo

Comment: Interesting tool for pretrained model EDA.

Notes:
feature attribution methods for explanability

  • gradient x input / gradient saliency
  • integrated gradients
  • LIME
  • Occlusion (sliding window occlusions)
  • Shapley permutations
    none generates feature attribution maps

hugging face x facebook captum

Benchmarking Meta-embeddings: What Works and What Does Not

presentation code

Comment: ??? What’s this…

Set Generation Networks for End-to-End Knowledge Base Population

paper

Comment: another transformer type model for end to end task, but no code release. no reproducibility.

Notes:
KB populatlation. ner -> entity linking -> relation extraction
promne to error propogation
some tried seq2seq models for e2e kbp
frame kbp as set generation networks. transformer encoder, non-autoregressive decoder decoder, and bipartite matching loss.
works on wiki and geo dataset

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

presentation

Comment: No code.

Notes:
introduces distiller framework for knolwedge distilation.
intermediate distillation most important. maximize mutual info lower bound

Adversarial Attacks: small input perturbations which leads to misclassification by the model

paper presentation code

Comment: Useful for word level adversarial attacks or data augementations.

Notes:

sentence level
word level
character level

adversarial attacks on NER only currently in character level

rq1

attack strategies:
char level: deepworldbugi-i and deepworldbug-ii
word level: bert-attack (bert), clare (roberta). named entities not changed
sentence level: scpn, generate paraphrase using defined syntactic parses

char level works best, but fails when named entities are prohibited. and unlikely to happen in real world. use levenshtein distance word limits.
word level works great
sentence level doesnt work

rq2
deepworldbug-i: use 500 samples to do adversarial training/data augmentation. resulted in improved robustness
bert attack: use 1000 samples to work. improved accuracies

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

paper presentation

Comment: Model based evaluation metrics using CLIP.

Notes:
Past: compare candidate to references w/ text-text similarity, e.g., BLEU, CIDEr, METEOR, SPICE, BERTScore. But unreliable, references expensive to collect.

Proposal: Use CLIP. Use text encoder and image encoder and calculate cosine similarity, higher better.

RefCLIPScore, use human reference text into text encoders as well.

Document-level Entity-based Extraction as Template Generation

paper presentation

Comment: just one other method. Cheaper to use annotators.

Notes: Fill entity into template. Non typed relations. Neither extractive approach (error propagation) or generative approach works (hard to fit into seq-to-seq and lead to info loss). Uses BART model.

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

paper presentation code

Comment: Nicely categorize different NLG tasks into 3 buckets. Anothe rmodel based evaluation.

Notes:
Categhorize based on information change from input (X) to output (Y)

  1. Compression (Big X -> small Y): summarization, image captioning, data-to-text, question generation
  2. Transduction (big X -> big Y): translation, paraphrasing, style transfer, language simplification
  3. Creation (small X -> big Y): dialog, advice generation, story generation, poetry generation

To evaluate: information alignment. text a -> model -> arbitratrary data b

compression:
consistency(y, x) = mean(align(y->x))
relevance(y, x, r) = mean(align(r -> y)) x mean(align(y -> x))

transduction:
preservation(y, x) -> f1 score

creation:
engagingness(y, x, c) = sum(align(y -> [x, c]))
groundedness(y, c) = sum(align(y -> c))

alignment modeling: embedding matching, discriminative model, aggregated regression

CNN/DM, XSUM-QAGS dataset, Mir et al Yelp style transfer dataset, mehri and eskenazi peronachat and topicalchat kb dialog dataset.

Types of Out-of-Distribution Texts and How to Detect Them

Paper presentation

comment: academic. Not useful. Another model based metrics.

Notes:
Detecting OOD texts.

Two types of OOD text samples:
background shift -> destimation, p(x), ppl
semantit shift -> calibration, p(y|x)

detection methods:
a. calibration based: model’s confidence on the label
b. density estimaion
detect via thresholding

Combining Lexical and Dense Retrieval for Computationally Efficient Multi-hop Question Answering

Comment: No code. using bm25 is going to be super fast compare to dense methods
Notes:
multi-hop retrieval on open corpus/reasoning
retriever reduce seearch space -> reading comprehension model extract answers

multi-hop questions heuristics:
question is more lexically/semantically distant from at least one relevant passage
high lexical overlap between question and first relevant passage

thus, proposed hybrid approach:
sparse retrieval for first relevant passage, dense for the rest of the pasage
bm25 -> ranker -> dense retriever.

compare to dense retrieval models DPR and MDR. compete on similar compute using EM@2, 10, 20

Efficient Nearest Neighbor Language Models

paper presentation code

Comment: paper proposed different methods to optimize knnLM speed. Achieved 6x speed up, but still dramatically slower than neural LM.

Notes:
most models are parametric
non-parametrics models rely on database to help predict.

nearest neighbor lm.
p(w|c) = lambda p_knn (w|c) + (1-lambda) pnlm(w|c)
the second term is pretrained and frozen. c context w word

send in query context to the nlm to get the query embedding for retreival basd on softmax of negative distance, and do a linear interpolation between the two terms

based on knn lm

expensive, due to datastore size

fix:

adaptive retrieval - use mlp to predict lambda to retrieve only useful tokens. saved 50% retreival ops.
dimension reduction - pca
datastore pruning - random, k-means, rank-based, greedy merging. greedy merging achieve 60% compression rate while only lose 0.2 ppl

speed up 6.6x from knnlm

Challenges in Detoxifying Language Models

presentation

Comment: No good definition of toxicity. Detoxifying hurts minority data more than the majority.

Notes:
Perspective API (Jigsaw) RealToxicityPrompts for content moderation.

An utterance is considered toxic if it is rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion. Subjective, context-dependent, ambiguity.

Still lots of false positives and require human evaluation.
low toxicity score hurts LM strength, hurts minority groups more than usual.

Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators’ Disagreement

paper presentation

Comment: a dataset with significant sampling bias on the selection of annotators. model trained is able to detect offensive language agreed upon the sample of annoatotors. duh.

Notes:
research dataset:
400k tweets on covid, us election, blm
subsample to 12k tweets, balanced per domain
5 annotations for each
model traind when annotators in agreement is effective in detecting offensive languages.

Just What do You Think You’re Doing, Dave?’ A Checklist for Responsible Data Use in NLP

Comment: trying to set standard for dataset generation and approval process. yay standard setting in academia.

Notes:
widely agreed on principles for responsible data

  • copyright and terms
  • privacy (vs crime prevention)
  • transparency (vs intellectual property)
  • reproducibility (vs right to be forgotten)
  • ‘do no harm’

Trainable Ranking Models to Evaluate the Semantic Accuracy of Data-to-Text Neural Generator

paper

Machine-in-the-Loop Rewriting for Creative Image Captioning

paper