Notes on from EMNLP 2021
LMdiff: A Visual Diff Tool to Compare Language Models
Comment: Would be interesting to use the tool to drill on language model memorizations
Notes: visualization by compares internal states of language models to see the differences of the inferenced results and how the distributino differs.
LightTag: Text Annotation Platform
Comment: Do not see significant advantage over Prodigy, Doccano, or Label Studio. Doesn’t have active learning hooks either.
Notes: Another text annotation tool. Has entity relationships diagram, infinite scrolling, batch action.
Thermostat: A Large Collection of NLP Model Explanations and Analysis Tools
Comment: Interesting tool for pretrained model EDA.
Notes:
feature attribution methods for explanability
- gradient x input / gradient saliency
- integrated gradients
- LIME
- Occlusion (sliding window occlusions)
- Shapley permutations
none generates feature attribution maps
hugging face x facebook captum
Benchmarking Meta-embeddings: What Works and What Does Not
Comment: ??? What’s this…
Set Generation Networks for End-to-End Knowledge Base Population
Comment: another transformer type model for end to end task, but no code release. no reproducibility.
Notes:
KB populatlation. ner -> entity linking -> relation extraction
promne to error propogation
some tried seq2seq models for e2e kbp
frame kbp as set generation networks. transformer encoder, non-autoregressive decoder decoder, and bipartite matching loss.
works on wiki and geo dataset
Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing
Comment: No code.
Notes:
introduces distiller framework for knolwedge distilation.
intermediate distillation most important. maximize mutual info lower bound
Adversarial Attacks: small input perturbations which leads to misclassification by the model
Comment: Useful for word level adversarial attacks or data augementations.
Notes:
sentence level
word level
character level
adversarial attacks on NER only currently in character level
rq1
attack strategies:
char level: deepworldbugi-i and deepworldbug-ii
word level: bert-attack (bert), clare (roberta). named entities not changed
sentence level: scpn, generate paraphrase using defined syntactic parses
char level works best, but fails when named entities are prohibited. and unlikely to happen in real world. use levenshtein distance word limits.
word level works great
sentence level doesnt work
rq2
deepworldbug-i: use 500 samples to do adversarial training/data augmentation. resulted in improved robustness
bert attack: use 1000 samples to work. improved accuracies
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Comment: Model based evaluation metrics using CLIP.
Notes:
Past: compare candidate to references w/ text-text similarity, e.g., BLEU, CIDEr, METEOR, SPICE, BERTScore. But unreliable, references expensive to collect.
Proposal: Use CLIP. Use text encoder and image encoder and calculate cosine similarity, higher better.
RefCLIPScore, use human reference text into text encoders as well.
Document-level Entity-based Extraction as Template Generation
Comment: just one other method. Cheaper to use annotators.
Notes: Fill entity into template. Non typed relations. Neither extractive approach (error propagation) or generative approach works (hard to fit into seq-to-seq and lead to info loss). Uses BART model.
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation
Comment: Nicely categorize different NLG tasks into 3 buckets. Anothe rmodel based evaluation.
Notes:
Categhorize based on information change from input (X) to output (Y)
- Compression (Big X -> small Y): summarization, image captioning, data-to-text, question generation
- Transduction (big X -> big Y): translation, paraphrasing, style transfer, language simplification
- Creation (small X -> big Y): dialog, advice generation, story generation, poetry generation
To evaluate: information alignment. text a -> model -> arbitratrary data b
compression:
consistency(y, x) = mean(align(y->x))
relevance(y, x, r) = mean(align(r -> y)) x mean(align(y -> x))
transduction:
preservation(y, x) -> f1 score
creation:
engagingness(y, x, c) = sum(align(y -> [x, c]))
groundedness(y, c) = sum(align(y -> c))
alignment modeling: embedding matching, discriminative model, aggregated regression
CNN/DM, XSUM-QAGS dataset, Mir et al Yelp style transfer dataset, mehri and eskenazi peronachat and topicalchat kb dialog dataset.
Types of Out-of-Distribution Texts and How to Detect Them
comment: academic. Not useful. Another model based metrics.
Notes:
Detecting OOD texts.
Two types of OOD text samples:
background shift -> destimation, p(x), ppl
semantit shift -> calibration, p(y|x)
detection methods:
a. calibration based: model’s confidence on the label
b. density estimaion
detect via thresholding
Combining Lexical and Dense Retrieval for Computationally Efficient Multi-hop Question Answering
Comment: No code. using bm25 is going to be super fast compare to dense methods
Notes:
multi-hop retrieval on open corpus/reasoning
retriever reduce seearch space -> reading comprehension model extract answers
multi-hop questions heuristics:
question is more lexically/semantically distant from at least one relevant passage
high lexical overlap between question and first relevant passage
thus, proposed hybrid approach:
sparse retrieval for first relevant passage, dense for the rest of the pasage
bm25 -> ranker -> dense retriever.
compare to dense retrieval models DPR and MDR. compete on similar compute using EM@2, 10, 20
Efficient Nearest Neighbor Language Models
Comment: paper proposed different methods to optimize knnLM speed. Achieved 6x speed up, but still dramatically slower than neural LM.
Notes:
most models are parametric
non-parametrics models rely on database to help predict.
nearest neighbor lm.
p(w|c) = lambda p_knn (w|c) + (1-lambda) pnlm(w|c)
the second term is pretrained and frozen. c context w word
send in query context to the nlm to get the query embedding for retreival basd on softmax of negative distance, and do a linear interpolation between the two terms
based on knn lm
expensive, due to datastore size
fix:
adaptive retrieval - use mlp to predict lambda to retrieve only useful tokens. saved 50% retreival ops.
dimension reduction - pca
datastore pruning - random, k-means, rank-based, greedy merging. greedy merging achieve 60% compression rate while only lose 0.2 ppl
speed up 6.6x from knnlm
Challenges in Detoxifying Language Models
Comment: No good definition of toxicity. Detoxifying hurts minority data more than the majority.
Notes:
Perspective API (Jigsaw) RealToxicityPrompts for content moderation.
An utterance is considered toxic if it is rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion. Subjective, context-dependent, ambiguity.
Still lots of false positives and require human evaluation.
low toxicity score hurts LM strength, hurts minority groups more than usual.
Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators’ Disagreement
Comment: a dataset with significant sampling bias on the selection of annotators. model trained is able to detect offensive language agreed upon the sample of annoatotors. duh.
Notes:
research dataset:
400k tweets on covid, us election, blm
subsample to 12k tweets, balanced per domain
5 annotations for each
model traind when annotators in agreement is effective in detecting offensive languages.
Just What do You Think You’re Doing, Dave?’ A Checklist for Responsible Data Use in NLP
Comment: trying to set standard for dataset generation and approval process. yay standard setting in academia.
Notes:
widely agreed on principles for responsible data
- copyright and terms
- privacy (vs crime prevention)
- transparency (vs intellectual property)
- reproducibility (vs right to be forgotten)
- ‘do no harm’