Natural Language Processing
Natural Language Processing is a branch of artificial intelligence which deals with the analysis and creation of texts in a natural language (such as Polish or English).
Neurosoft has been conducting research in this field for over 10 year now. Our work concentrates primarily on developing methods of grammatical analysis of texts written in Polish. It resulted in releasing in January 2000 the NeuroGram system – a unique in the world commercial system to morphosyntactic analysis of texts written in Polish.
The main advantages of Neurosoft’s technology are:
- a complete dictionary of base definitions (almost 135,000 words) which contains a highly accurate grammatical classification of each word (division into parts of speech, grammatical categories, the relationship between words etc.).
- fully automatic module to generate all inflections for all the base definitions (on the basis of the dictionary mentioned above more than 2 100,000 forms are generated) which takes into account all the irregularities in the Polish language,
- high efficiency of all the algorithms, which is essential in operations where a great amount of text is to be processed (e.g. a website enabling access to millions of documents),
- flexibility – suitable to various operating systems (Linux, UNIX, MS Windows Server).
Gram in the current version is used, among others, for automatic creation of indexes in full-text search engines dedicated to the Polish language (full text search), which due to the nature of the language allows to significantly reduce the number of saved key words, radically improving the efficiency and precision of the search.
At the moment, we are working on grammatical analysis (both surface parsing, dependency parsing and methods combining those two approaches are being developed). Those works have resulted in a new functionality implemented in the new version of NeuroGram 3 which allows to generate the automatic sentence structure into basic components. This functionality will significantly make generating summaries and automatic preparations of translations from the Polish language or into Polish easier.
The use of this technology will significantly improve the quality of solutions for the synthesis and analysis of speech. The new quality of grammatical analysis is also to support the standardization and correction algorithms of the text. It should also enable the detection of grammatical errors (e.g. inflection errors) and improve the quality of the generated basic forms, e.g. the quality of correction of the “Polishlike” texts (written in Polish, but without Polish letters) etc.
Another extension will be based on the development of the mechanism of detection and interpretation of regular expressions (both at the level of individual words consisting of letters and at the level of sentences consisting of words). This mechanism will cooperate with parsers and will allow referring to their results. The ability to define their own regular expressions and methods of interpretation will e.g. allow the user to individually implement a number of tasks related to natural language processing, such as the analysis of queries in the natural language in references to different types of databases.
We also take into account the possibility of the use of dictionaries of idioms and phrases which, like grammatical analysis, should improve the quality of performance of all high-level algorithms of Gram. Dictionaries of idioms and phrases are particularly useful while solving the ambiguity problems at the stage of generating the basic forms from inflected forms. We also plan to integrate our knowledge of the relations between words with the semantic relations derived from the WordNet.
Along with the new grammatical analysis implementation in Gram there is also a possibility to analyze queries in the natural language.