ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.
まずは、学習者コーパス。母語話者の添削文が付与されている学習者コーパスといえば、NICER: the Nagoya Interlanguage Corpus of English Rebornということで、今回はNICERを使ってみる。
*JPN513: education of teachers %NTV: Education of Teachers %COM: %par: *JPN513: Let's think about education of teachers in Japan. %NTV: Let's think about the education of teachers in Japan. %COM: %par: *JPN513: Firstly, in Japan how do people become teachers? %NTV: First, in Japan, how do people become teachers? %COM: The "-ly" ending after numbers is old-fashioned. Just say "First", "Second", etc. *JPN513: I know two ways. %NTV: OK %COM: *JPN513: One is to enter school for growing teachers. %NTV: One is to enter a teacher's college. %COM: *JPN513: The other is to study education besides the major. %NTV: The other is to study education in addition to another major. %COM: *JPN513: Near here Aichi education college are there. %NTV: NG %COM: 後略
*JPN513: Let's think about education of teachers in Japan. *JPN513: Firstly, in Japan how do people become teachers?
%NTV: Let's think about the education of teachers in Japan. %NTV: First, in Japan, how do people become teachers?
#!/usr/bin/python # -*- coding: utf-8 -*- import sys,re,glob argvs = sys.argv argc = len(argvs) file_lists = sorted(glob.glob(sys.argv[1] + '*.txt')) for filename in file_lists: fn = filename.strip("test/"".txt") fw = open(str(fn) + "_ori.txt", 'w') fw2 = open(str(fn) + "_cor.txt", 'w') with open(filename, 'r') as fn: files = fn.read() for i in files: text = re.findall('(\*.+)', files, re.IGNORECASE) text_ori = "\n".join(text) text2 = re.findall('(%NTV.+)', files, re.IGNORECASE) text_cor = "\n".join(text2) fw.write(str(text_ori)) fw2.write(str(text_cor)) fw.close() fw2.close()
学習者 *JPN513: education of teachers *JPN513: Let's think about education of teachers in Japan. *JPN513: Firstly, in Japan how do people become teachers? *JPN513: I know two ways. *JPN513: One is to enter school for growing teachers. *JPN513: The other is to study education besides the major. *JPN513: Near here Aichi education college are there.
母語話者添削文 %NTV: Education of Teachers %NTV: Let's think about the education of teachers in Japan. %NTV: First, in Japan, how do people become teachers? %NTV: OK %NTV: One is to enter a teacher's college. %NTV: The other is to study education in addition to another major. %NTV: NG
#!/usr/bin/python # -*- coding: utf-8 -*- import sys,os,re,glob argvs = sys.argv argc = len(argvs) os.makedirs(sys.argv[1], exist_ok = True) for fn in sorted(glob.glob(sys.argv[2] + '*.txt')): fn2 = fn.strip(sys.argv[2] + "txt") fn3 = fn2.strip("\.") fw_ori = open(sys.argv[1] + fn3 + "_ori.txt", 'w') fw_cor = open(sys.argv[1] + fn3 + "_cor.txt", 'w') with open(fn, 'r') as fn4: files = fn4.readlines() for i in files: original = re.findall(r'(?:\*JAN\d\d\d\d:\t)(.+\n)', i, re.IGNORECASE) if len(original) != 0: result_ori = original[0] fw_ori.write(result_ori) correct = re.findall(r'(?:%NTV:\t)(.+\n)', i, re.IGNORECASE) if len(correct) != 0: cor_rep = [j.replace('OK\s?\n', result_ori) for j in correct] fw_cor.write(cor_rep[0]) fw_ori.close() fw_cor.close()使い方は、MacならTerminal上で、WindowsならCygwin上などで
python3 プログラム名 保存先ディレクトリ名 抽出対象ディレクトリ名なので、上のプログラムをextraction.pyという名前で保存するとし、保存先のディレクトリ名を「result」、NICERの保存場所が「NICER」とすると、
python3 extraction.py result/ NICER/これで走るはず。
学習者 education of teachers Let's think about education of teachers in Japan. Firstly, in Japan how do people become teachers? I know two ways. One is to enter school for growing teachers. The other is to study education besides the major. Near here Aichi education college are there.母語話者添削文 Education of Teachers Let's think about the education of teachers in Japan. First, in Japan, how do people become teachers? I know two ways. One is to enter a teacher's college. The other is to study education in addition to another major. NG最初に書いたプログラムだと、「I know two ways」の部分が、母語話者の添削文だと「OK」だったのが、新しいプログラムでは、しっかりと学習者が書いた正しい文に書き換わっている。