# CS代写|自然语言处理代写Natural Language Processing代考|MPCS53113 Ensuring Validity Using Box Plots

## CS代写|自然语言处理代写Natural Language Processing代考|Ensuring Validity Using Box Plots

In Section $2.8$ we discussed implicit bias, which occurs when a model picks up on unrelated cues in a data set. This causes the model to make predictions that look accurate but are actually based on incorrect generalizations. This is called over-fitting. The basic idea is that the predictions of the model look accurate on a specific test set; but the model has been selected to fit that test set. In other words, if we had evaluated the model on a different set of examples, the prediction accuracy could be significantly lower.

Over-fitting is especially a problem when we are working with smaller data sets, like those that have been labeled by hand. If we only have a few hundred examples in the test set or if one of the classes is especially rare, then our performance is being evaluated on just a few examples. Let’s say that we hand-label news articles by topic and there is a topic CORRUPTION that has only five examples. We put three examples in the training data and keep two examples in the testing data. This means that our classifier is being evaluated on just two examples of this class.

There are two techniques for dealing with this problem: cross-validation and validation sets. For a shallow classifier like logistic regression, which holds all the data in memory at once, we use cross-validation. The basic idea here is that we train and test many times, on different parts of the data. If we repeat our process five times, it is called 5 -fold cross-validation; if we repeat the process ten times, it is called 10-fold cross-validation. In each case, we rotate what data is used for training and what data is used for testing until every sample has been used in the testing set once and only once. Thus, 10 -fold cross-validation uses a $90 / 10$ training/testing split. It is important to realize that cross-validation does not provide a single classifier, because we have actually trained and tested many different classifiers. But it does provide a robust understanding of the classifier’s expected performance.

## CS代写|自然语言处理代写Natural Language Processing代考|Unmasking Pseudonymous Authors Using Line Plots

Sometimes we need to visualize more than one model over time. For example, we have seen that we can use a text classifier to identify different authors using function word n-grams. When we work with books written by nineteenthcentury authors, this model performs very well. But how robust is the classifier? How deep are these individual stylistic differences? Let’s say that we have two writers, A and B. Writer A never starts a sentence with And. However, Writer B does so frequently. So, every text by Author A has zero sentences starting with $A$ And but Author $\mathrm{B}$ has hundreds of sentences starting with $A$ and. We might have a classifier with perfect accuracy, but only because this one feature distinguishes between the two writers. We would not consider this model to be very meaningful: There is a lot more to stylistics than just this one feature.

To measure robustness, we use a technique called unmasking (Koppel, Schler, \& Bonchek-Dokow, 2007), shown in Table 31. We train a logistic regression classifier to identify each author in the corpus. That means each feature in our vector, each function word n-gram, is getting a weight between $-1$ and 1 . We can use the feature weights to find out what the most important features are. Unmasking works like this: We train and test the classifier many times. But, each time, we remove the most predictive features, one for each author. By the end of the unmasking process we have many different f-scores, each based on fewer predictive features.

## CS代写|自然语言处理代写NATURAL LANGUAGE PROCESSING代考|UNMASKING PSEUDONYMOUS AUTHORS USING LINE PLOTS

