Abstract:
News in the form of web data generates increasingly large amounts of information as unstructured text. The capability of
understanding the meaning of news is limited to humans; thus, it causes information overload. This hinders the effective use of
embedded knowledge in such texts. Therefore, Automatic Knowledge Extraction (AKE) has now become an integral part of
Semantic web and Natural Language Processing (NLP). Although recent literature shows that AKE has progressed, the results
are still behind the expectations. This study proposes a method to auto-extract surface knowledge from English news into a
machine-interpretable semantic format (triple). The proposed technique was designed using the grammatical structure of the
sentence, and 11 original rules were discovered. The initial experiment extracted triples from the Sri Lankan news corpus, of
which 83.5% were meaningful. The experiment was extended to the British Broadcasting Corporation (BBC) news dataset to
prove its generic nature. This demonstrated a higher meaningful triple extraction rate of 92.6%. These results were validated
using the inter-rater agreement method, which guaranteed the high reliability.