Active Learning - Literature Review
What is Active Learning? Why is it important?
Active learning is a type of semi-supervised machine learning approach that enables machines to run a learning algorithm on unlabelled data, cleverly ``ask questions” and interactively aids in querying and annotating with human in the loop to obtain desired outputs at new data points.
Many modern machine learning techniques require large amounts of training data (labelled ones) to achieve optimum results and actively annotating is not possible. However, active learning operates on unlabelled data, tries to learn the task and inform human what labels would be most useful at the current state. This aims to ease the data collection process by automatically deciding which instances an annotator should label to train an algorithm as quickly and effectively as possible.
In our case, we have E-mail data that are unlabelled and require us to classify. So, active learning is one of the semi-supervised machine learning approach that will help human in the loop to annotate various data points of interest.
What is Active Learning in Visual Analytics? What are the benefits & limitations?
Active Learning (AL) in Visual Analytics (VA) is a special type of incremental supervised machine learning (ML) where the users are involved (user-in-the-loop) in the learning process to guide the training & analysis. It can also be called as a semi-supervised machine learning. In the AL, user will constantly query, annotate/label data and improve the quality of the learning model.
Benefits: Active Learning is useful in cases where large portions of the data are to be analysed and are unlabeled, where manual labeling is expensive, and in cases of live monitoring of streaming data (email or social media data), where new unlabeled data needs to be processed continuously.
Drawbacks: In the Active Learning, users are not considered in the identification and selection of instances, but only in the labeling itself.
In 1994, Shneiderman proposed the filter-flow method as a means for applying filters on structured multidimensional data for creating user-defined dynamic queries using Boolean approach/functionality. This dynamic query approach lets users explore a database rapidly, which helps in quickly discovering various sections of a multidimensional search space based on the user interest (example, clusters, gaps, outliers etc.).
In 2004, Wong et al. developed IN-SPIRE, where users can interact with the data, query search and select interesting or uninteresting subsets of data to see more details on the remaining ones that are re-clustered and re-projected as a result. Until recently, a weakness was the lack of ability to add new documents to an existing dataset. Now users can update datasets, or merge two datasets, and the processing takes advantage of the previously known document analysis.
In 2008, Elmquist et al. proposed, DataMeadow, a similar approach to Shneiderman’s approach for multivariate (multidimensional) data with rich interaction for constructing visual queries using graphical set representations, which will allow users to create visual queries by iteratively selecting, filtering and creating sets, subsets of multidimensional data.
In 2008, Dork et al. developed, VisGets, co-ordinated views that can aid in formulating queries that simultaneously combine temporal and spatial data filters for web-based search interfaces. The interactive query visualisation is used for visual information seeking (searching) and exploratory search. The main goal was to facilitate the construction of dynamic search queries that combine filters from more than one data dimension. The users can integrate metadata-based filters, i.e., spatiotemporal restrictions and keyword queries, directly from the multiple coordinated view environment.
The previous approaches considered human knowledge along with machine learning techniques without effective labelling. The following approaches considered human in the loop with effective labelling techniques, that is label training data for training algorithms, or humans giving insights or control over the model creation process.
In 2010, Seifert et al. proposed an approach for users to effectively label training data for document classification by generating from an unsupervised clustering, which will aid users to mark various regions of interest in the documents that can be labeled identically. This way the data can be trained, classified and visualised.
In 2011, Settles developed, DUALIST, an active learning annotation paradigm which solicits and learns from labels. Users can annotate on the features. The semi-supervised training algorithm is fast enough to support real-time interaction and provides accurate learning as pre-existing methods.
In 2012, Moehrmann and Heideman, developed an advanced user interface and introduced active learning approach to facilitate the labelling of large image datasets, as the task of annotating large image data sets manually takes a lot of time and effort. The integration of overview+detail concepts allows the precise navigation in- side large data sets.
Again in 2012, Hoferlin, developed inter-active learning approach, which extends active learning by integrating users’ expertise for posing queries of data instances for labeling, annotating manually/automatically and adjusting complex classifier models. This helps in the detection and correction of inconsistencies between the classifier model trained by examples and the user’s mental model of the class definition.
In 2012, Heimerl et al. compared three approaches (basic method, visual method and user-driven method) for interactive classifier training in a user study, incorporating active learning to various degrees in order to reduce the labeling effort as well as to increase effectiveness by monitoring the quality of the classifier in iterative feedback loops. The approaches can help to formulate search queries, define filter criteria based on analyst’s requirement using machine learning approach.
In 2013, Bosch et al. developed, ScatterBlogs2, a method of combining filters to dynamic unstructured data streams (live streaming) such as classification methods from the field of machine learning. Similar to Dork et al.’s approach, Bosch et al. developed an interface (training environment) for the creation of advanced filters to train and configure methods for labelling both relevant and irrelevant messages (live-monitoring). – user-guided filtering & classification.