word2vec对搜狗中文新闻进行聚类
(1)下载搜狗数据 http://www.sogou.com/labs/sogoudownload/SogouCA/news_tensite_xml.full.zip
(2)去除html标签 cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "" > corpus.txt
(3)分词 可以通过java包:ANSJ对文本分词。
(4) ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1
(5)计算距离 ./distance vectors.bin
(6)聚类 ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500
sort classes.txt -k 2 -n > classes.sorted.txt