頻出単語分析のプログラムです。コマンドライン引数で渡されたテキストの英単語の総数と上位10の頻出単語を表示するというものです。ただし、コマンドライン引数で渡すテキストは複数で、個々のテキストの分析とともにテキスト合計の分析も行います。
個々のテキストの単語の総数、上位10の頻出単語の表示までは非常に拙いながらも成功しましたが、総合上位10の頻出単語の表示で詰まってしまいました。さらに、それを視覚的に表示しなければいけないのですが、その点もどうしたらよいのか見当がつきません。文書ではうまく説明できていないと思います。表示画面のサンプルは以下の通りです。
$ python textanalyzer.py -i thesis.txt lol.txt loro.txt
### FILE: thesis.txt
32758 Words
the 1637 ( 4.996%) ==============================
i 511 ( 1.557%) ==========
a 327 ( 0.998%) ======
is 102 ( 0.311%) ==
my 100 ( 0.305%) ==
cool 50 ( 0.152%) =
python 32 ( 0.098%) =
linux 31 ( 0.098%) =
some 28 ( 0.085%) =
image 21 ( 0.064%) =
### FILE: lol.txt
1000 Words
lol 1000 (100.000%) ==============================
### FILE: loro.txt
1000 Words
loro 600 ( 60.000%) ==============================
orol 400 ( 40.000%) ====================
### TOTAL
34758 Words
the 1637 ( 4.710%) ==============================
lol 1000 ( 2.877%) ==================
loro 600 ( 1.726%) ============
i 511 ( 1.470%) ========
orol 400 ( 1.151%) =======
a 327 ( 0.941%) =====
is 102 ( 0.293%) ==
my 100 ( 0.288%) ==
cool 50 ( 0.144%) =
python 32 ( 0.092%) =
以下がコードです。
#!/user/bin/env python3
import sys
argvs = sys.argv
argc = len(argvs)
if (argc < 2):
print ('Usage: # python %s filename' % argvs[0])
quit()
elif (argc == 2):
print ('\n ### FILE: %s\n' % argvs[1])
f= open(argvs[1])
data = f.read()
# counting of 1st file
words = {}
i = 0
for word in data.split():
words[word] = words.get(word, 0)+1
i=i+len(word.split())
print (' %d Words\n' % i)
# sort by count of 1st file
d = [(v,k) for k,v in words.items()]
d.sort()
d.reverse()
for count, word in d[:10]:
print ( " %s \t:%d (%.3f%% ) " % (word,count,(count/i)*100) )
f.close()
elif (argc == 3):
print ('\n ### FILE: %s\n' % argvs[1])
f= open(argvs[1])
data = f.read()
# counting of 1st file
words = {}
i = 0
for word in data.split():
words[word] = words.get(word, 0)+1
i=i+len(word.split())
print (' %d Words\n' % i)
# sort by count of 1st file
d = [(v,k) for k,v in words.items()]
d.sort()
d.reverse()
for count, word in d[:10]:
print ( " %s\b\t:%d (%.3f%% ) " % (word,count,(count/i)*100) )
print ('\n ### FILE: %s\n' % argvs[2])
f= open(argvs[2])
data = f.read()
# counting of 2nd file
words = {}
j = 0
for word in data.split():
words[word] = words.get(word, 0)+1
j=j+len(word.split())
print (' %d Words\n' % j)
# sort by count of 2nd file
d = [(v,k) for k,v in words.items()]
d.sort()
d.reverse()
for count, word in d[:10]:
print ( " %s\b\t:%d (%.3f%% ) " % (word,count,(count/j)*100) )
print ('\n ### Total \n')
print (' %d Words\n' % (i+j))
f.close()
else:
print ('Usage: # python %s filename filename' % argvs[0])
quit()