2024-02-21 10:51:33



head text8 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institutions anarchists advocate social relations based upon voluntary association of autonomous individuals mutual aid and self governance while anarchism is most easily defined by what it is against anarchists also offer positive visions of what they believe to be a truly free society however ideas about how an anarchist


$ fasttext skipgram -input text8  -output text8_ft
Read 17M words
Number of words:  71290
Number of labels: 0Progress:  99.8% words/sec/thread:    7369 lr:  0.000101 avg.loss:  1.769820 ETA:   0h 0Progress:  99.8% words/sec/thread:    7370 lr:  0.000094 avg.loss:  1.769829 ETA:   0h 0Progress:  99.8% words/sec/thread:    7369 lr:  0.000090 avg.loss:  1.769800 ETA:   0h 0Progress:  99.8% words/sec/thread:    7369 lr:  0.000087 avg.loss:  1.769800 ETA:   0h 0Progress:  99.8% words/sec/thread:    7369 lr:  0.000080 avg.loss:  1.769757 ETA:   0h 0Progress:  99.8% words/sec/thread:    7369 lr:  0.000076 avg.loss:  1.769715 ETA:   0h 0Progress:  99.9% words/sec/thread:    7369 lr:  0.000071 avg.loss:  1.769715 ETA:   0h 0Progress:  99.9% 


$ head -n 3 text8_ft.vec 
71290 100
the -0.12909 0.055616 -0.3696 -0.18719 -0.12103 0.2775 0.01854 0.20936 0.09752 -0.19813 -0.21285 0.083418 0.016162 0.14121 -0.24223 -0.0094061 0.028332 0.37123 -0.11781 -0.19176 0.37214 -0.039365 -0.097866 -0.0050016 -0.28954 -0.45515 0.014649 0.23746 0.047788 0.20878 0.03758 0.058175 -0.14702 -0.059107 -0.056708 0.18481 -0.24313 -0.077047 -0.052451 -0.14229 0.44761 -0.078969 -0.34184 0.23616 0.13514 0.10314 -0.0066068 0.43448 0.65234 0.47869 0.4127 0.41117 0.03521 -0.10341 -0.19444 0.28458 0.011618 0.086314 -0.31977 -0.14575 0.11906 0.009967 0.21598 0.1113 0.020127 -0.10383 0.066471 -0.22908 -0.1199 0.25135 0.19716 -0.072622 0.23 -0.025951 0.042014 0.021876 0.039729 -0.55375 0.095557 0.46388 -0.37981 0.00016889 0.2524 -0.32383 0.32751 -0.44859 -0.10094 -0.12716 -0.40568 -0.060579 -0.12642 0.17714 -0.079242 -0.13409 0.058547 0.13197 -0.015039 0.025645 -0.066081 0.12708 of -0.16715 -0.0010567 -0.19993 -0.093281 -0.085323 0.16803 0.02827 0.21193 -0.18461 0.030626 0.061337 0.20897 -0.048829 0.072872 -0.24804 0.14995 -0.10427 0.25362 -0.25377 -0.046909 0.23103 -0.13958 0.096698 -0.13873 -0.18816 -0.33925 -0.12769 0.025515 0.1927 -0.23886 -0.096003 0.12565 -0.38746 -0.17257 -0.016184 0.11388 -0.11505 -0.135 0.18531 -0.31078 0.25641 -0.21784 -0.23305 0.48851 0.29054 -0.093619 -0.088168 0.40308 0.49952 0.4213 0.18668 0.26579 -0.10406 -0.0013798 -0.15389 0.31486 0.036097 0.032645 -0.11297 0.26994 -0.031791 0.034534 0.0045391 -0.082605 0.16027 -0.1163 0.045438 -0.18456 -0.033046 0.14392 0.38028 0.00054076 0.17435 0.008556 0.19375 -0.020889 0.17603 -0.48627 0.0014847 0.23283 -0.18314 -0.071 -0.028154 -0.34701 0.20839 -0.21952 -0.1269 -0.01303 -0.34134 -0.018452 -0.088293 0.1442 -0.010917 -0.18804 0.029666 0.12227 -0.059641 -0.099701 0.080151 0.098683 


2,对于OOV问题(out-of-vocabulary words),如何得到词向量,由上面的bin模型得到推理结果

$ cat testQuery.txt 
what the fuck
out of bug


$ fasttext print-word-vectors text8_ft.bin < testQuery.txt >results
$ head results 
what -0.057545 -0.48528 -0.20754 -0.15859 -0.14724 0.039533 0.23823 -0.010322 -0.11841 0.2602 -0.071378 0.045908 -0.1794 0.13509 0.42207 -0.073658 -0.085075 -0.010533 -0.30685 -0.23157 -0.0038759 -0.22726 0.11984 0.097364 -0.32854 -0.12644 0.10312 0.05729 0.0088756 -0.12448 0.12922 0.16195 0.22631 -0.14809 0.015782 0.88848 -0.22506 0.31695 -0.017969 0.067788 0.022775 -0.30599 0.10087 0.57101 0.32064 0.16622 -0.17665 -0.064036 0.79752 0.46684 0.43368 0.36142 0.076338 0.21368 0.051775 -0.24059 0.34093 0.19272 -0.43182 -0.10237 -0.07673 0.081198 0.030859 -0.30472 -0.072027 -0.049737 0.025858 0.20029 0.23727 0.21938 0.40949 -0.066096 0.21677 -0.35277 0.12356 -0.26148 0.34904 -0.2038 -0.20233 -0.11801 -0.24752 0.33782 0.0098645 -0.38913 -0.19182 0.11744 -0.065232 -0.13656 -0.4755 0.10589 -0.20734 0.033725 -0.092295 0.083127 -0.26734 0.29432 0.2051 -0.1562 -0.041519 0.1008 
the -0.12909 0.055616 -0.3696 -0.18719 -0.12103 0.2775 0.01854 0.20936 0.09752 -0.19813 -0.21285 0.083418 0.016162 0.14121 -0.24223 -0.0094061 0.028332 0.37123 -0.11781 -0.19176 0.37214 -0.039365 -0.097866 -0.0050016 -0.28954 -0.45515 0.014649 0.23746 0.047788 0.20878 0.03758 0.058175 -0.14702 -0.059107 -0.056708 0.18481 -0.24313 -0.077047 -0.052451 -0.14229 0.44761 -0.078969 -0.34184 0.23616 0.13514 0.10314 -0.0066068 0.43448 0.65234 0.47869 0.4127 0.41117 0.03521 -0.10341 -0.19444 0.28458 0.011618 0.086314 -0.31977 -0.14575 0.11906 0.009967 0.21598 0.1113 0.020127 -0.10383 0.066471 -0.22908 -0.1199 0.25135 0.19716 -0.072622 0.23 -0.025951 0.042014 0.021876 0.039729 -0.55375 0.095557 0.46388 -0.37981 0.00016889 0.2524 -0.32383 0.32751 -0.44859 -0.10094 -0.12716 -0.40568 -0.060579 -0.12642 0.17714 -0.079242 -0.13409 0.058547 0.13197 -0.015039 0.025645 -0.066081 0.12708 
fuck -0.058611 0.057183 -0.041783 -0.37217 0.14209 0.34844 -0.63363 -0.36179 -0.072163 0.91156 0.03035 0.11818 0.67802 0.081026 0.64936 -0.12426 0.22982 -0.23246 0.040846 0.041818 0.27794 -0.0099458 -0.19554 0.54899 -0.44809 -0.31202 -0.22453 0.10881 -0.036528 -0.12731 0.40714 0.065295 0.57494 0.034111 0.3151 -0.031521 0.71399 -0.014006 -0.12132 0.23345 0.70018 -0.050306 0.36475 0.52981 0.25617 -0.3498 -0.25729 -0.19234 0.39339 0.050153 0.59596 -0.41099 -0.16302 -0.37753 -0.31371 -0.1496 0.19898 -0.33186 -1.0232 0.22755 0.71151 -0.025874 -0.10878 -0.76363 -0.80891 -0.10293 0.61912 0.5186 0.30178 0.032113 0.50403 0.14278 0.35163 -0.37008 -0.40752 -0.62272 0.50291 -0.096062 -0.23859 0.21181 0.49698 0.71006 0.25118 -0.61219 -0.16518 -0.083687 0.2768 -0.13805 -0.71201 0.40129 -0.080268 -0.15334 0.21017 0.075741 -0.5743 -0.15687 0.84504 -0.74026 0.51993 0.20547 
out 0.095084 -0.34668 -0.29661 0.36503 -0.049586 0.52637 0.21526 0.0082911 -0.33428 0.26074 -0.11496 0.40547 -0.0020223 0.29337 0.039203 0.10698 -0.37423 0.22085 -0.037315 0.092291 0.21265 -0.11413 -0.1042 0.047826 0.083402 -0.1864 0.1972 -0.35872 0.071064 -0.32934 -0.14132 0.26032 -0.00452 0.039306 0.21692 0.28521 0.11242 0.32081 0.0083984 -0.32079 0.25809 -0.52832 -0.032795 0.31803 0.361 0.081924 -0.32014 0.039908 0.6 0.47681 0.13996 0.11896 0.059675 -0.33345 -0.10751 0.089404 0.37752 -0.07873 -0.16767 0.1458 -0.10502 -0.18125 0.24368 0.1482 -0.41592 0.13236 0.22565 -0.0059395 0.1614 0.046295 0.45359 -0.12962 0.33642 -0.21669 -0.27091 -0.16509 0.18419 -0.27586 0.12269 -0.012149 -0.23497 0.20923 0.43814 -0.32106 -0.17071 -0.0025727 -0.025948 -0.071002 -0.2163 0.12129 0.17356 -0.159 -0.26937 0.21498 0.11852 -0.014236 0.28358 -0.30305 0.20611 -0.20913 
of -0.16715 -0.0010567 -0.19993 -0.093281 -0.085323 0.16803 0.02827 0.21193 -0.18461 0.030626 0.061337 0.20897 -0.048829 0.072872 -0.24804 0.14995 -0.10427 0.25362 -0.25377 -0.046909 0.23103 -0.13958 0.096698 -0.13873 -0.18816 -0.33925 -0.12769 0.025515 0.1927 -0.23886 -0.096003 0.12565 -0.38746 -0.17257 -0.016184 0.11388 -0.11505 -0.135 0.18531 -0.31078 0.25641 -0.21784 -0.23305 0.48851 0.29054 -0.093619 -0.088168 0.40308 0.49952 0.4213 0.18668 0.26579 -0.10406 -0.0013798 -0.15389 0.31486 0.036097 0.032645 -0.11297 0.26994 -0.031791 0.034534 0.0045391 -0.082605 0.16027 -0.1163 0.045438 -0.18456 -0.033046 0.14392 0.38028 0.00054076 0.17435 0.008556 0.19375 -0.020889 0.17603 -0.48627 0.0014847 0.23283 -0.18314 -0.071 -0.028154 -0.34701 0.20839 -0.21952 -0.1269 -0.01303 -0.34134 -0.018452 -0.088293 0.1442 -0.010917 -0.18804 0.029666 0.12227 -0.059641 -0.099701 0.080151 0.098683 
bug -0.16104 0.22345 -0.52171 -0.049254 0.36398 -0.03377 -0.51757 -0.13128 -0.033654 0.44559 -0.73595 -0.17421 -0.061673 0.15399 -0.17079 -0.35185 0.2719 0.58866 -0.18934 -0.38255 -0.55436 5.6939e-05 0.47935 0.79757 -0.21634 -0.27231 -0.7705 -0.27486 -0.080184 -0.13623 0.25086 0.55783 0.23359 0.079897 0.24158 0.45196 -0.034684 0.070867 -0.47792 -0.44604 -0.17802 -0.40082 0.16075 0.36177 0.85764 -0.13079 -0.21857 -0.24954 -0.1655 0.20273 0.028715 0.54311 -0.16729 0.041986 -0.14236 -0.022988 0.77909 0.038478 -0.59859 -0.084233 0.39918 -0.36386 -0.12653 -0.41765 -0.28527 0.25547 0.1974 0.17408 0.28804 0.79494 -0.016819 -0.025348 0.3845 -0.35161 -0.3202 0.48525 0.01959 0.32804 -0.31761 0.44232 0.13141 0.17387 0.0097161 0.052898 0.24716 0.050469 0.073792 0.026017 -0.72611 0.41077 0.25149 0.16558 -0.12419 -0.86742 0.26589 -0.42548 0.26709 0.061441 0.24726 0.25026

当然如果上述dim(也就是embedding size)太大了,也可自己定义,

$ fasttext skipgram -input text8 -dim 10  -output text8_ft10
$ head -n 3 text8_ft10.vec 
71290 10
the -0.69471 -0.35273 0.18617 -0.3283 0.28874 0.35978 -0.50711 -0.11573 -0.30905 -0.58648 
of -0.87699 -0.46422 0.10984 -0.15627 0.49961 0.22101 -0.40932 -0.24884 -0.20546 -0.54027 


数据集The DBpedia ontology classification dataset,本体分类数据集,14个类别,每个类别选取40k作为训练集,5k作为测试集,因此总的训练集为560k,测试集样本70k

cat classes.txt 


1,"Bergan Mercy Medical Center"," Bergan Mercy Medical Center is a hospital located in Omaha Nebraska. It is part of the Alegent Health System."
1,"The Unsigned Guide"," The Unsigned Guide is an online contacts directory and careers guide for the UK music industry. Founded in 2003 and first published as a printed directory The Unsigned Guide became an online only resource in November 2011."
。。。#277356808 Q group
14,"The Blithedale Romance"," The Blithedale Romance (1852) is Nathaniel Hawthorne's third major romance. In Hawthorne (1879) Henry James called it the lightest the brightest the liveliest of Hawthorne's unhumorous fictions."
14,"Razadarit Ayedawbon"," Razadarit Ayedawbon (Burmese: ရာဇာဓိရာဇ် အရေးတော်ပုံ) is a Burmese chronicle covering the history of Ramanya from 1287 to 1421. The chronicle consists of accounts of court intrigues rebellions diplomatic missions wars etc. About half of the chronicle is devoted to the reign of King Razadarit (r."
14,"The Vinyl Cafe Notebooks"," Vinyl Cafe Notebooks: a collection of essays from The Vinyl Cafe (2010) is Stuart McLean's ninth book and each one has been a Canadian bestseller. McLean has sold over 1 million books in Canada. Unlike the other Vinyl Cafe books these are not Dave and Morley stories.Selected from 15 years of radio-show archives and re-edited by the author this eclectic essay collection provides a glimpse into the thoughtful mind at work behind The Vinyl Cafe."


myshuf() {perl -MList::Util=shuffle -e 'print shuffle(<>);' "$@";
#Q group 277356808
normalize_text() {tr '[:upper:]' '[:lower:]' | sed -e 's/^/__label__/g' | \sed -e "s/'/ ' /g" -e 's/"//g' -e 's/\./ \. /g' -e 's/
/ /g' \-e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \-e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' | tr -s " " | myshuf }


1,"TY KU"," TY KU /taɪkuː/ is an American alcoholic beverage company that specializes in sake and other spirits. The privately-held company was founded in 2004 and is headquartered in New York City New York. While based in New York TY KU's beverages are made in Japan through a joint venture with two sake breweries. Since 2011 TY KU's growth has extended its products into all 50 states."
1,"Odd Lot Entertainment"," OddLot Entertainment founded in 2001 by longtime producers Gigi Pritzker and Deborah Del Prete (The Wedding Planner) is a film production and financing company based in Culver City California.OddLot produced the film version of Orson Scott Card's sci-fi novel Ender's Game. A film version of this novel had been in the works in one form or another for more than a decade by the time of its release."#after process
__label__1 , ty ku , ty ku /taɪkuː/ is an american alcoholic beverage company that specializes in sake and other spirits . the privately-held company was founded in 2004 and is headquartered in new york city new york . while based in new york ty ku ' s beverages are made in japan through a joint venture with two sake breweries . since 2011 ty ku ' s growth has extended its products into all 50 states . 
__label__1 , odd lot entertainment , oddlot entertainment founded in 2001 by longtime producers gigi pritzker and deborah del prete ( the wedding planner ) is a film production and financing company based in culver city california . oddlot produced the film version of orson scott card ' s sci-fi novel ender ' s game . a film version of this novel had been in the works in one form or another for more than a decade by the time of its release . 


fasttext supervised -input dbpedia.train -output trainout -dim 10 -lr 0.1 -wordNgrams 2 -minCount 1 -bucket 10000000 -epoch 5 -thread 4


$ head -n 6 trainout.vec 
802981 10
the 0.48158 0.13413 -0.5119 0.62694 0.089501 -0.024228 -0.13503 0.23139 0.041772 0.081158 
. -0.61252 -0.32307 0.78123 -0.56232 -0.0014737 -0.019952 0.22725 0.065144 -0.23527 -0.053442 
, -0.38554 -0.35668 0.071955 0.54615 -0.041367 -0.010555 -0.11941 0.3101 -0.077714 -0.35903 
in 0.159 -0.21333 0.048756 -0.058684 1.0204 0.54013 1.2182 -0.02415 -0.004165 0.6187 
of -0.078618 -0.11361 -0.32771 0.63844 -0.79154 0.32892 -0.55461 -0.47428 -0.6273 0.51869 


$ fasttext test trainout.bin dbpedia.test 
N	70000
P@1	0.985
R@1	0.985
$ fasttext predict trainout.bin dbpedia.test >dbpedia.test.predict
$ head dbpedia.test.predict 



$ head enwik9
Wikipedia 1.6alphafirst-letterMediaSpecial

需要预处理,预处理文件在此 (,用于过滤Wikipedia XML转储到仅由小写字母(a-z,从a-z转换而来)和空格(从不连续)组成的“干净”文本的程序。所有其他字符都转换为空格。仅显示通常出现在web浏览器中的文本。表将被删除。保留图像标题。链接被转换为普通文本。数字是拼出来的。

perl enwik9 > file9


fasttext skipgram -input file9 -output file9out -lr 0.025 -dim 10 -ws 5 -epoch 3 -minCount 5 -neg 5 -loss ns -bucket 2000000 -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100


  -minCount           minimal number of word occurrences [1]-minCountLabel      minimal number of label occurrences [0]-wordNgrams         max length of word ngram [1]-bucket             number of buckets [2000000]-minn               min length of char ngram [0]-maxn               max length of char ngram [0]-t                  sampling threshold [0.0001]-label              labels prefix [__label__]-lr                 learning rate [0.1]-lrUpdateRate       change the rate of updates for the learning rate [100]-dim                size of word vectors [100]-ws                 size of the context window [5]-epoch              number of epochs [5]-neg                number of negatives sampled [5]-loss               loss function {ns, hs, softmax} [softmax]-thread             number of threads [12]-pretrainedVectors  pretrained word vectors for supervised learning []-saveOutput         whether output params should be saved [0]-cutoff             number of words and ngrams to retain [0]-retrain            finetune embeddings if a cutoff is applied [0]-qnorm              quantizing the norm separately [0]-qout               quantizing the classifier [0]-dsub               size of each sub-vector [2]





曾经的爱 曾经的爱我自己来回答我可能是真的喜欢她但是,每次不知道为什么看到她就一中伤心的感觉不知道谁能告诉我这...
小公鸡不见偷偷的跟在小鸭子后面... 小公鸡不见偷偷的跟在小鸭子后面也下了水偷偷地这个词加上这个词的好处是小公鸡不见偷偷的跟在小鸭子后面也...
开紫钻一年多少钱? 开紫钻一年多少钱?我卖180 ,一年大概120 玩旋舞特适合用财富通开8.8折一共是211
关于韩剧 关于韩剧灰姑娘的姐姐吧,不错。还有一枝梅。狗与狼的时间,都好看。
谁小泰罗的电影? 谁小泰罗的电影?小泰罗奥特曼的电影? 你说的这个是奥特曼剧场版《奥特物语》网上搜一下就能找到
电纸书和mp4功能一样吗,买电... 电纸书和mp4功能一样吗,买电子书好还是mp4好啊?推荐一款吧?谢谢看你具体干什么了,如果喜欢看电子...
关于ff14的装备出售问题! ... 关于ff14的装备出售问题! 为什么好多装备无法在市场出售?紫装绿装不说,为什么普通白装也不能关于f...
为什么佛法的争论那么激烈? 为什么佛法的争论那么激烈?因为他们没有一个统一的经典,来指导他们,告诉他们真理. 不然他们就不会拜那...
为什么有的调皮美女不喜欢调皮男... 为什么有的调皮美女不喜欢调皮男人,不稳重没有安全感。她喜欢成熟有男人味安全感男人?大部分女人都喜欢成...
哪些男明星男人味儿彰显,成熟的... 哪些男明星男人味儿彰显,成熟的魅力撩动心弦让你沦陷?朱一龙,张若昀,李钟硕,李承铉,靳东等,这些男明...
天书奇谈里皇城里的盘龙石柱在哪... 天书奇谈里皇城里的盘龙石柱在哪阿、?点左上角的【寻】会出来一个框,里面有蟠龙石柱,点击左键会自动寻找...
50年前,长沙镖子岭。老烟头把... 50年前,长沙镖子岭。老烟头把他的旱烟在地上敲了敲:“下不下去喃?”独眼的小伙子说:“不去,电视剧5...
如何评价演员袁姗姗,她是不是真... 如何评价演员袁姗姗,她是不是真的不适合娱乐圈?因人而异吧,我个人还是比较喜欢她的,虽然长得确实不漂亮...
各路大神? 各路大神?既然可以发送、接受文件,可以确认电脑蓝牙驱动正常,可以连接手机蓝牙。接下来要确认手机有蓝牙...
三国演义诸葛亮的资料 三国演义诸葛亮的资料三国演义诸葛亮的资料去看史书就知道了诸葛亮子孔明,号卧龙,南阳人,著作出师表,隆...
如何隐藏一个硬盘分区,只能我自... 如何隐藏一个硬盘分区,只能我自己能进,别人都看不到。用patition magic 隐藏分区
韩剧名字其中两个字是爱情 韩剧名字其中两个字是爱情没关系,是爱情啊--赵寅成,孔孝真最佳爱情--车胜元,孔孝真爱情能用钱买吗-...
谁炒股一年能挣百万 谁炒股一年能挣百万规定时间 数量 你以为在算加减法 能有正解答案 股市是有时长时短 时多时少 这是没...
一个你不爱的人 一个你不爱的人面拒绝 呵呵,女的都出来了,这问题到是比较符合你们!其他的怎么女的回答的那么少??
有人有曲的成语 有人有曲的成语曲尽人情【读音】qū jìn rén qíng【释义】委婉周到地把人之常情或世态充分体...