ä¸ã课ç¨ä»ç»
æ¯å¦ç¦å¤§å¦äº2012å¹´3æå¨Courseraå¯å¨äºå¨çº¿èªç¶è¯è¨å¤ç课ç¨ï¼ç±NLPé¢å大çDan Jurafsky å Chirs Manningæææ课ï¼
https://class.coursera.org/nlp/以ä¸æ¯æ¬è¯¾ç¨çå¦ä¹ ç¬è®°ï¼ä»¥è¯¾ç¨PPT/PDF为主ï¼å
¶ä»åèèµæ为è¾
ï¼èå
¥ä¸ªäººæå±ã注解ï¼æç å¼çï¼æ¬¢è¿å¤§å®¶å¨âæç±å
¬å¼è¯¾âä¸ä¸èµ·æ¢è®¨å¦ä¹ ã
课件æ±æ»ä¸è½½å°åï¼æ¯å¦ç¦å¤§å¦èªç¶è¯è¨å¤çå
¬å¼è¯¾è¯¾ä»¶æ±æ»
äºãè¯è¨æ¨¡å(Language Model)
1ï¼N-gramä»ç»
å¨å®é
åºç¨ä¸ï¼æ们ç»å¸¸éè¦è§£å³è¿æ ·ä¸ç±»é®é¢ï¼å¦ä½è®¡ç®ä¸ä¸ªå¥åçæ¦çï¼å¦ï¼
æºå¨ç¿»è¯ï¼P(high winds tonite) > P(large winds tonite)
æ¼åçº éï¼P(about fifteen minutes from) > P(about fifteenminuets from)
è¯é³è¯å«ï¼P(I saw a van) >> P(eyes awe of an)
é³å转æ¢ï¼P(ä½ ç°å¨å¹²ä»ä¹|nixianzaiganshenme) > P(ä½ è¥¿å®å¨å¹²ä»ä¹|nixianzaiganshenme)
èªå¨ææãé®çç³»ç»ã... ...
以ä¸é®é¢çå½¢å¼å表示å¦ä¸ï¼
p(S)=p(w1,w2,w3,w4,w5,â¦,wn)
=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)//é¾è§å
p(S)被称为è¯è¨æ¨¡åï¼å³ç¨æ¥è®¡ç®ä¸ä¸ªå¥åæ¦çç模åã
é£ä¹ï¼å¦ä½è®¡ç®p(wi|w1,w2,...,wi-1)å¢ï¼æç®åãç´æ¥çæ¹æ³æ¯ç´æ¥è®¡æ°åé¤æ³ï¼å¦ä¸ï¼
p(wi|w1,w2,...,wi-1) = p(w1,w2,...,wi-1,wi) / p(w1,w2,...,wi-1)
ä½æ¯ï¼è¿éé¢ä¸´ä¸¤ä¸ªéè¦çé®é¢ï¼æ°æ®ç¨ç严éï¼åæ°ç©ºé´è¿å¤§ï¼æ æ³å®ç¨ã
åºäºé©¬å°ç§å¤«å设ï¼Markov Assumptionï¼ï¼ä¸ä¸ä¸ªè¯çåºç°ä»
ä¾èµäºå®åé¢çä¸ä¸ªæå 个è¯ã
å设ä¸ä¸ä¸ªè¯çåºç°ä¾èµå®åé¢çä¸ä¸ªè¯ï¼åæï¼
p(S)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)
=p(w1)p(w2|w1)p(w3|w2)...p(wn|wn-1) // bigram
å设ä¸ä¸ä¸ªè¯çåºç°ä¾èµå®åé¢ç两个è¯ï¼åæï¼
p(S)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)
=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|wn-1,wn-2) // trigram
é£ä¹ï¼æ们å¨é¢ä¸´å®é
é®é¢æ¶ï¼å¦ä½éæ©ä¾èµè¯ç个æ°ï¼å³nã
æ´å¤§çnï¼å¯¹ä¸ä¸ä¸ªè¯åºç°ç约æä¿¡æ¯æ´å¤ï¼å
·ææ´å¤§ç辨å«åï¼
æ´å°çnï¼å¨è®ç»è¯æåºä¸åºç°ç次æ°æ´å¤ï¼å
·ææ´å¯é çç»è®¡ä¿¡æ¯ï¼å
·ææ´é«çå¯é æ§ã
ç论ä¸ï¼nè¶å¤§è¶å¥½ï¼ç»éªä¸ï¼trigramç¨çæå¤ï¼å°½ç®¡å¦æ¤ï¼ååä¸ï¼è½ç¨bigram解å³ï¼ç»ä¸ä½¿ç¨trigramã
2ï¼æé è¯è¨æ¨¡å
é常ï¼éè¿è®¡ç®æ大似ç¶ä¼°è®¡ï¼Maximum Likelihood Estimateï¼æé è¯è¨æ¨¡åï¼è¿æ¯å¯¹è®ç»æ°æ®çæ佳估计ï¼å
¬å¼å¦ä¸ï¼
p(w1|wi-1) = count(wi1-, wi) / count(wi-1)
å¦ç»å®å¥åéâ<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>â
é¨åbigramè¯è¨æ¨¡åå¦ä¸æ示ï¼
c(wi)å¦ä¸:
c(wi-1,wi)å¦ä¸:
åbigram为ï¼
é£ä¹ï¼å¥åâ<s> I want english food </s>âçæ¦ç为ï¼
p(<s> I want english food </s>)=p(I|<s>)
à P(want|I)
à P(english|want)
à P(food|english)
à P(</s>|food)
= .000031
为äºé¿å
æ°æ®æº¢åºãæé«æ§è½ï¼é常ä¼ä½¿ç¨ålogå使ç¨å æ³è¿ç®æ¿ä»£ä¹æ³è¿ç®ã
log(p1*p2*p3*p4) = log(p1) + log(p2) + log(p3) + log(p4)
æ¨èå¼æºè¯è¨æ¨¡åå·¥å
·ï¼
SRILMï¼
http://www.speech.sri.com/projects/srilm/ï¼
IRSTLMï¼
http://hlt.fbk.eu/en/irstlmï¼
MITLMï¼
http://code.google.com/p/mitlm/ï¼
BerkeleyLMï¼
http://code.google.com/p/berkeleylm/ï¼
æ¨èå¼æºn-gramæ°æ®éï¼
Google Web1T5-gramï¼
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.htmlï¼
Total number of tokens: 1,306,807,412,486
Total number of sentences: 150,727,365,731
Total number of unigrams: 95,998,281
Total number of bigrams: 646,439,858
Total number of trigrams: 1,312,972,925
Total number of fourgrams: 1,396,154,236
Total number of fivegrams: 1,149,361,413
Total number of n-grams: 4,600,926,713
Google Book N-gramsï¼
http://books.google.com/ngrams/ï¼
Chinese Web 5-gramï¼
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T06ï¼
3ï¼è¯è¨æ¨¡åè¯ä»·
è¯è¨æ¨¡åæé å®æåï¼å¦ä½ç¡®å®å¥½åå¢ï¼ ç®å主è¦æ两ç§è¯ä»·æ¹æ³ï¼
å®ç¨æ¹æ³ï¼éè¿æ¥ç该模åå¨å®é
åºç¨ï¼å¦æ¼åæ£æ¥ãæºå¨ç¿»è¯ï¼ä¸ç表ç°æ¥è¯ä»·ï¼ä¼ç¹æ¯ç´è§ãå®ç¨ï¼ç¼ºç¹æ¯ç¼ºä¹é对æ§ãä¸å¤å®¢è§ï¼
ç论æ¹æ³ï¼è¿·æ度/å°æ度/混乱度ï¼preplexityï¼ï¼å
¶åºæ¬ææ³æ¯ç»æµè¯éèµäºè¾é«æ¦çå¼çè¯è¨æ¨¡åè¾å¥½ï¼å
¬å¼å¦ä¸ï¼
ç±å
¬å¼å¯ç¥ï¼è¿·æ度è¶å°ï¼å¥åæ¦çè¶å¤§ï¼è¯è¨æ¨¡åè¶å¥½ã使ç¨ãåå°è¡æ¥æ¥ãè®ç»æ°æ®è§æ¨¡ä¸º38million wordsæé n-gramè¯è¨æ¨¡åï¼æµè¯éè§æ¨¡ä¸º1.5million wordsï¼è¿·æ度å¦ä¸è¡¨æ示ï¼
4ï¼æ°æ®ç¨çä¸å¹³æ»ææ¯
大è§æ¨¡æ°æ®ç»è®¡æ¹æ³ä¸æéçè®ç»è¯æä¹é´å¿
ç¶äº§çæ°æ®ç¨çé®é¢ï¼å¯¼è´é¶æ¦çé®é¢ï¼ç¬¦åç»å
¸çzip'få®å¾ãå¦IBM, Brownï¼366Mè±è¯è¯æè®ç»trigramï¼å¨æµè¯è¯æä¸ï¼æ14.7%çtrigramå2.2%çbigramå¨è®ç»è¯æä¸æªåºç°ã
æ°æ®ç¨çé®é¢å®ä¹ï¼âThe problem of data sparseness, also known as the zero-frequency problem arises when analyses contain configurations that never occurred in the training corpus. Then it is not possible to estimate probabilities from observed frequencies, and some other estimation scheme that can generalize (that configurations) from the training data has to be used. ââ Daganâã
人们为ç论模åå®ç¨åèè¿è¡äºä¼å¤å°è¯ä¸åªåï¼è¯çäºä¸ç³»åç»å
¸çå¹³æ»ææ¯ï¼å®ä»¬çåºæ¬ææ³æ¯âéä½å·²åºç°n-gramæ¡ä»¶æ¦çåå¸ï¼ä»¥ä½¿æªåºç°çn-gramæ¡ä»¶æ¦çåå¸éé¶âï¼ä¸ç»æ°æ®å¹³æ»åä¸å®ä¿è¯æ¦çå为1ï¼è¯¦ç»å¦ä¸ï¼
Add-oneï¼Laplaceï¼ Smoothing
å ä¸å¹³æ»æ³ï¼å称ææ®ææ¯å®å¾ï¼å
¶ä¿è¯æ¯ä¸ªn-gramå¨è®ç»è¯æä¸è³å°åºç°1次ï¼ä»¥bigram为ä¾ï¼å
¬å¼å¦ä¸ï¼
å
¶ä¸ï¼Væ¯ææbigramç个æ°ã
æ¿æ¥ä¸ä¸èç»çä¾åï¼ç»Add-one Smoothingåï¼c(wi-1, wi)å¦ä¸æ示ï¼
åbigram为ï¼
å¨V >> c(wi-1)æ¶ï¼å³è®ç»è¯æåºä¸ç»å¤§é¨ån-gramæªåºç°çæ
åµï¼ä¸è¬é½æ¯å¦æ¤ï¼ï¼Add-one Smoothingåæäºâå§å®¾å¤ºä¸»âçç°è±¡ï¼ææä¸ä½³ãé£ä¹ï¼å¯ä»¥å¯¹è¯¥æ¹æ³æ©å±ä»¥ç¼è§£æ¤é®é¢ï¼å¦Lidstone's Law,Jeffreys-Perks Lawã
Good-Turing Smoothing
å
¶åºæ¬ææ³æ¯å©ç¨é¢ççç±»å«ä¿¡æ¯å¯¹é¢çè¿è¡å¹³æ»ãè°æ´åºç°é¢ç为cçn-gramé¢ç为c*ï¼
ä½æ¯ï¼å½nr+1æè
nr > nr+1æ¶ï¼ä½¿å¾æ¨¡åè´¨éåå·®ï¼å¦ä¸å¾æ示ï¼
ç´æ¥çæ¹è¿çç¥å°±æ¯â对åºç°æ¬¡æ°è¶
è¿æ个éå¼çgramï¼ä¸è¿è¡å¹³æ»ï¼éå¼ä¸è¬å8~10âï¼å
¶ä»æ¹æ³è¯·åè§âSimple Good-Turingâã
Interpolation Smoothing
ä¸ç®¡æ¯Add-oneï¼è¿æ¯Good Turingå¹³æ»ææ¯ï¼å¯¹äºæªåºç°çn-gramé½ä¸è§åä»ï¼é¾å
åå¨ä¸åçï¼äºä»¶åçæ¦çåå¨å·®å«ï¼ï¼æ以è¿éåä»ç»ä¸ç§çº¿æ§æå¼å¹³æ»ææ¯ï¼å
¶åºæ¬ææ³æ¯å°é«é¶æ¨¡ååä½é¶æ¨¡åä½çº¿æ§ç»åï¼å©ç¨ä½å
n-gram模å对é«å
n-gram模åè¿è¡çº¿æ§æå¼ãå 为å¨æ²¡æ足å¤çæ°æ®å¯¹é«å
n-gram模åè¿è¡æ¦ç估计æ¶ï¼ä½å
n-gram模åé常å¯ä»¥æä¾æç¨çä¿¡æ¯ãå
¬å¼å¦ä¸ï¼
æ©å±æ¹å¼ï¼ä¸ä¸æç¸å
³ï¼ä¸ºï¼
λså¯ä»¥éè¿EMç®æ³æ¥ä¼°è®¡ï¼å
·ä½æ¥éª¤å¦ä¸ï¼
é¦å
ï¼ç¡®å®ä¸ç§æ°æ®ï¼Training dataãHeld-out dataåTest dataï¼
ç¶åï¼æ ¹æ®Training dataæé åå§çè¯è¨æ¨¡åï¼å¹¶ç¡®å®åå§çλsï¼å¦å为1ï¼ï¼
æåï¼åºäºEMç®æ³è¿ä»£å°ä¼åλsï¼ä½¿å¾Held-out dataæ¦çï¼å¦ä¸å¼ï¼æ大åã
Kneser-Ney Smoothing
Web-scale LMs
å¦Google N-gramè¯æåºï¼å缩æ件大å°ä¸º27.9Gï¼è§£åå1Tå·¦å³ï¼é¢å¯¹å¦æ¤åºå¤§çè¯æèµæºï¼ä½¿ç¨åä¸è¬éè¦å
åªæï¼Pruningï¼å¤çï¼ç¼©å°è§æ¨¡ï¼å¦ä»
使ç¨åºç°é¢ç大äºthresholdçn-gramï¼è¿æ»¤é«é¶çn-gramï¼å¦ä»
使ç¨n<=3çèµæºï¼ï¼åºäºçµå¼åªæï¼ççã
å¦å¤ï¼å¨åå¨ä¼åæ¹é¢ä¹éè¦åä¸äºä¼åï¼å¦ä½¿ç¨trieæ°æ®ç»æåå¨ï¼åå©bloom filterè¾
å©æ¥è¯¢ï¼æstringæ å°ä¸ºintç±»åå¤çï¼åºäºhuffmanç¼ç ãVarintçæ¹æ³ï¼ï¼float/double转æintç±»åï¼å¦æ¦çå¼ç²¾ç¡®å°å°æ°ç¹å6ä½ï¼ç¶åä¹10E6ï¼å³å¯å°æµ®ç¹æ°è½¬ä¸ºæ´æ°ï¼ã
2007å¹´Google Inc.çBrants et al.æåºäºé对大è§æ¨¡n-gramçå¹³æ»ææ¯âââStupid Backoffâï¼å
¬å¼å¦ä¸ï¼
æ°æ®å¹³æ»ææ¯æ¯æé é«é²æ£æ§è¯è¨æ¨¡åçéè¦æ段ï¼ä¸æ°æ®å¹³æ»çææä¸è®ç»è¯æåºçè§æ¨¡æå
³ãè®ç»è¯æåºè§æ¨¡è¶å°ï¼æ°æ®å¹³æ»çææè¶æ¾èï¼è®ç»è¯æåºè§æ¨¡è¶å¤§ï¼æ°æ®å¹³æ»çææè¶ä¸æ¾èï¼çè³å¯ä»¥å¿½ç¥ä¸è®¡ââé¦ä¸æ·»è±ã
5ï¼è¯è¨æ¨¡ååç§
Class-based N-gram Model
该æ¹æ³åºäºè¯ç±»å»ºç«è¯è¨æ¨¡åï¼ä»¥ç¼è§£æ°æ®ç¨çé®é¢ï¼ä¸å¯ä»¥æ¹ä¾¿èåé¨åè¯æ³ä¿¡æ¯ã
Topic-based N-gram Model
该æ¹æ³å°è®ç»éæ主é¢ååæå¤ä¸ªåéï¼å¹¶å¯¹æ¯ä¸ªåéåå«å»ºç«N-gramè¯è¨æ¨¡åï¼ä»¥è§£å³è¯è¨æ¨¡åç主é¢èªéåºé®é¢ãæ¶æå¦ä¸ï¼
Cache-based N-gram Model
该æ¹æ³å©ç¨cacheç¼ååä¸æ¶å»çä¿¡æ¯ï¼ä»¥ç¨äºè®¡ç®å½åæ¶å»æ¦çï¼ä»¥è§£å³è¯è¨æ¨¡åå¨æèªéåºé®é¢ã
-People tends to use words as few as possible in the article.
-If a word has been used, it would possibly be used again in the future.
æ¶æå¦ä¸ï¼
çæµè¿æ¯ç®åQQãæçãè°·æçæºè½æ¼é³è¾å
¥æ³æéç¨çç¥ï¼å³é对ç¨æ·ä¸ªæ§åè¾å
¥æ¥å¿å»ºç«åºäºcacheçè¯è¨æ¨¡åï¼ç¨äºå¯¹éç¨è¯è¨æ¨¡åè¾åºç»æçè°æï¼å®ç°è¾å
¥æ³ç个æ§åãæºè½åãç±äºå¨æèªéåºæ¨¡åçå¼å
¥ï¼äº§åè¶ç¨è¶æºè½ï¼è¶ç¨è¶å¥½ç¨ï¼è¶ç¨è¶ä¸ç¾ã
Skipping N-gram Model&Trigger-based N-gram Model
äºè
æ ¸å¿ææ³é½æ¯å»ç»è¿è·ç¦»çº¦æå
³ç³»ã
ææ°è¯è¨æ¨¡åï¼æ大çµæ¨¡åMaxEntãæ大çµé©¬å°ç§å¤«æ¨¡åMEMMãæ¡ä»¶éæºå模åCRF
ä¼ ç»çn-gramè¯è¨æ¨¡åï¼åªæ¯èèäºè¯å½¢æ¹é¢çç¹å¾ï¼è没æè¯æ§ä»¥åè¯ä¹å±é¢ä¸çç¥è¯ï¼å¹¶ä¸æ°æ®ç¨çé®é¢ä¸¥éï¼ç»å
¸çå¹³æ»ææ¯ä¹é½æ¯ä»ç»è®¡å¦è§åº¦è§£å³ï¼æªèèè¯æ³ãè¯ä¹çè¯è¨å¦ä½ç¨ã
MaxEntãMEMMãCRFå¯ä»¥æ´å¥½çèå
¥å¤ç§ç¥è¯æºï¼å»ç»è¯è¨åºåç¹ç¹ï¼è¾å¥½çç¨äºè§£å³åºåæ 注é®é¢ã