A Framework for Efficient Text Analysis in Wings

We have developed a framework to assist scientists with computational experiments for text analysis. It takes advantage of the unique capabilities of Wings to reason about constraints and encompasses over 50 components for machine learning and text analysis tasks. The framework also contains workflows for text classification, text clustering and the visualization of results and can be used with several of the most common datasets from the text classification research community.

The following diagram shows a text classification workflow that can be executed in Wings. A scientist can use this workflow to perform experiments and test various different algorithms and parameters to achieve the best results for his dataset. It starts with two files that contain the training and the testing set. Both datasets go through some preprocessing steps first, to remove stop respectively small words. In the next step a Stemmer is applied in order to remove morphological variations from the datasets. The Stemmer is obtained in grey color since it is an abstract component and can be specialized with different concrete components (e.g., Porter Stemmer). The Term Weighting component creates a word vector space model of the dataset, while an additional Feature Selection is applied on the training set. The Feature Selection only selects the best features with respect to the computed Feature Score, which is computed by the Correlation Score (e.g., Mutual Information, Information Gain, Chi Square) component. These preprocessed files are used to train a Model with the reduced training set from the Feature Selection. Predictions on the training set are computed with the Classifier next. The Modeler and the Classifier use implementations (e.g., kNN, Decision Tree, Support Vector Machines, Naive Bayes) from open source machine learning frameworks. In the last step a component called Evaluate computes result metrics based on the prediction for the test dataset.

Wings validates the workflow and rejects invalid combinations, like different machine learning algorithms in the Classifier and the Modeler. The next diagram shows some results from the workflow. It was executed with the common WebKB dataset (http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/) and shows results for alternative specializations of the workflow with different Term Weighting, Correlation Scoring, and learning algorithms: