AUTOMATIC CLASSIFICATION OF PUNJABI DOCUMENTS USING VECTOR SPACE MODEL by Harsimran Pal Kaur
Now-a-days, a gigantic amount of useful data is generated on web. The information may be in the form of journals, e-documents and web pages. Today, the vast amount of documents are created in every field and to manage and handle these documents, the classification is necessary task. Automatic classification is the process in which the class labels are assigned automatically to the material which is under processing. Automatic classification of documents is an approach that assigns predefined category labels to input text documents. Punjabi is Indo - Aryan regional language spoken by 102 million people. For regional languages there are limited classifiers are available with limited resources. Punjabi Document Classification System receives the unlabelled Punjabi document and assigns it a predefined class label. Automatic Classification of Punjabi Documents using Vector Space Model is performed by converting the documents into set of vectors after preprocessing of documents. Then the document vectors are used to compute cosine similarity. Cosine Similarity is the measure, which defines the class label of input document. The proposed system has been compared with the existing systems which gives 64% and 71% accuracy by using Naive Bayes and Centroid based techniques respectively and both existing systems developed by using Hybrid and Ontology based techniques gives 85% accuracy. Theseexisting systems use 180 documents of different categories for testing and training purposes. The proposed system has been validated by using 517 documents which contains thousands of words and gives 86.84% accuracy.