Modeling

Baseline regression models

Following EDA, we wanted to determine in a statistically rigorous way whether the aggregated Twitter sentiment for each candidate is correlated with his/her popularity as measured by polls (what we refer to as ground truth above). To do this, we performed a basic regression analysis for each candidate to calculate the statistical significance of Twitter sentiment as a predictor of ground truth. First, we fitted an OLS model by regressing the ground truth weekly time series on the aggregated mean twitter sentiment for the week before each poll date. The p-value provided by Statmodels summary method was very high for all candidates. This indicated that there is no evidence against the null hypothesis that the coefficient of the sentiment predictor is zero. In other words, the aggregated sentiment was not statistically significant. We also repeated the same analysis by regressing the ground truth weekly percentage change on the weekly sentiment percentage change. The p-values were also very high for this experiment.

In summary, these experiments indicated that the aggregated Twitter sentiment did not provide significant signal to predict the outcome of the next poll.

More advanced models

The next step after this preliminary regression analysis was to develop more advanced models that could make use of any possible signal present in the Twitter sentiment. But before we could begin the modeling process, we first had to specify the precise goal of any potential model. As explained in the introduction, the primary objective of this project is to determine the extent to which political tweets reflect the popularity of the candidates. Therefore, to help answer this question, we sought to develop a model that could use previous polls along with relevant tweets to predict future polling numbers (on the order of a few days up to a week in advance). If tweets do in fact shed light on the popularity of the candidates, we’d expect this information to produce estimates of the ground truth polling numbers that are more accurate than those formed using previous polls alone. On the other hand, if the tweets are completely unrelated to the underlying popularity of the candidates, we’d expect this data to worsen the predictions, as the model would be misinterpreting this noise as signal.

Kalman filter

Our initial approach was to fit a separate Kalman filter (state-space model) for each candidate to predict his/her latent true popularity by modeling it as a linear dynamical system. The state of the system would be two-dimensional; the first dimension would be the true proportion of the electorate that would vote for a specific candidate (i.e. popularity), while the second dimension of the state would be the true Twitter sentiment change for each candidate from one time step to the next. In short, our idea was that the current popularity of each candidate should be equal to his/her popularity at the previous time step, plus/minus the effect from the change in Twitter sentiment for this candidate. We could draw the following analogy to make the idea clearer: the first dimension of the state of the system is analagous to the position of a moving object at any given time, while for the second dimension of the system state, the true sentiment change is analagous to the velocity of the moving object. In other words, the current position of the object (true popularity of candidate) would depend on its previous position and its previous velocity (sentiment change).

Both the true sentiment change and the true popularity of each candidate are latent variables that we don’t directly observe. Instead, we only get to observe noisy observations of these variables given by the polls (the ground truth time series that we constructed) and the sentiment change that we calculate based on a sample of tweets that we obtain on a daily basis.

However, unlike a moving object’s position over time, which can be accurately described by kinematic equations, there was no clear relationship between the Twitter sentiment change and the popularity of the candidates in polls, as was indicated by the regression analysis discussed above. On the contrary, the regression analysis revealed that the aggregated sentiment or sentiment change is not correlated to the ground truth. For this reason, there was no principled way to write down the dynamics of the system in terms of actual transition and observation matrices used in the Kalman filter equations. Consequently, we decided not to proceed with this approach.

Bayesian model

An alternative way to approach this prediction task is from a Bayesian perspective. In the problem, we start off with the latest polls and then combine this information with recent tweet counts to predict future polling numbers. In this context, the latest polls represent our prior belief of the popularity of the candidates, while the tweet counts represent the data that we use to update our prior view to a posterior belief of the candidates’ popularity. To create a Bayesian model, we first had to set a prior distribution on the polling numbers of each candidate and specify the distribution from which the tweet counts are generated.

Because the support of all candidates in the race must add up to one it was sensible to model the polling numbers as being Dirichlet distributed. (To handle the fact that the popularity of the top five candidates doesn’t sum to one, we scaled the support of each candidate proportionally to ensure that their cumulative support added up to one.) An added benefit of this Bayesian approach is that it allowed us to model the popularity of all the candidates jointly rather than having to model each candidate separately. In addition to simplifying the modeling process, this model better reflects reality since it automatically considers the fact that the candidates’ support are negatively correlated with each other.

In terms of the data-generating process of the tweets, we envisioned the number of positive tweets about each candidate as being generated at a rate that matches the true proportion of the Democratic electorate who support this candidate. In this model, each supporter of a given candidate independently decides (with very small probability) to post a positive tweet about their preferred candidate. To simplify matters, we disregarded negative tweets and slightly positive tweets that had a sentiment score between 0.5 and 0.6. We didn’t take into account negative tweets since there was no easy way to combine these negative tweets with the positive tweets to create an overall polling score for each candidate (e.g. it didn’t make sense to treat each negative tweet as a negative vote for the associated candidate since doing so would underestimate their true support and could even lead to a candidate receiving negative votes, which makes no sense). Another motivation for ignoring all negative tweets was that it ensured that we wouldn’t inadvertantly be taking into account the views of Republican Twitter users who generally post negative tweets about the Democratic candidates but whose opinions have no influence on the Democratic primaries.

Our view of the data-generating process coupled with the fact that we disregarded negative tweets enabled us to take advantage of Dirichlet-multinomial conjugacy, in which we modeled the positive tweets as being independently drawn from a multinomial distribution with five categories, each with a probability proportional to the true support of the corresponding candidate.

Formally, our goal was to model the vector \mathbf{\theta} = [\theta_1, ..., \theta_5] where \theta_i represents the true proportion of the Democratic electorate supporting candidate i. Our prior belief on \mathbf{\theta} is based on the latest available polls and is modeled as \mathbf{\theta} \sim Dir([\alpha_1, ..., \alpha_5]), where \alpha_i is proportional to the support of candidate i according to the latest polls. Meanwhile, the number of positive tweets (X) corresponding to each candidate is modeled as \mathbf{X} \mid \mathbf{\theta} \sim MultiNom(\mathbf{\theta}).

The posterior distribution (\mathbf{\theta} \mid \mathbf{X}) is extremely straightforward to compute in this case due to conjugacy: \mathbf{\theta} \mid \mathbf{X} \sim Dir([\theta_1 + X_1, ..., \theta_5 + X_5]). Moreover, each \alpha_i has the added interpretation of being the pseudocount of positive Tweets for candidate i. This means that the degree to which the posterior is influenced by the prior and data is proportional to \|\mathbf{\alpha}\|_1 and \|\mathbf{X}\|_1, respectively.

Having specified the model, we first wanted to see how the posterior estimates of the candidates’ polling numbers compare to their true polling numbers when the weight of the prior is equal to the weight of the data. The plots below compare the ground truth time series with the posterior mean times series for each candidate. Please note that the mean of each dimension of a Dirichlet distributed random variable, \mathbf{\theta} \sim Dir([\alpha_1, ..., \alpha_5]), is given by \mathbf{E}(\theta_i) = \frac{\alpha_i}{\sum_{k=1}^{5} \alpha_k}.

From a preliminary visual inspection, it can be observed that the posterior mean estimates of the support for each candidate are not correlated with the ground truth polls.

The code used to generate the following plots can be found here.

Next, we sought to quantitatively measure the extent to which the number of positive tweets about each candidate is related to the candidate’s true support. To do this, we introduced hyperparameters \beta_{\alpha} and \beta_{X} to scale the L1 norms of \mathbf{\alpha} and \mathbf{X}, respectively. The ratio of these hyperparameters controls the relative weights of the prior and data, while the sum of the hyperparameters controls the shape of the posterior distribution, as higher values of \beta_{\alpha} + \beta_{X} produce sharper distributions that have less variance around the mean. We then performed a two-dimensional grid search over these hyperparameters to identify the optimal parameter values. The details of this grid search are presented below:

  • We split the time period for which we have polls (March through November) into intervals of six days. For each of these intervals, the goal was to predict the polling numbers five days in advance using the polling numbers on the first day (t_1) and Twitter data over the course of the next five days (t_2 through t_6).
  • Because our ground truth polling estimates were formed using all available polling data, it wasn’t appropriate to use our polling estimate on t_1 to form our prior since this estimate was influenced by future polls that had not yet been conducted. Therefore, for each day, we applied the same methodology to estimate the ground truth polling data using all polls that had been conducted up until the given day.
  • For each of the intervals, we set the \mathbf{\alpha} vector associated with the prior Dirichlet distribution to \frac{\beta_{\alpha}}{\sum_{i=1}^{5}S_{t_1,i}}[S_{t_1,1}, ..., S_{t_1,5}], where S_{t_1,i} represents the estimated support of candidate i on day t_1 according to the polls. This ensured that the L1 norm of \mathbf{\alpha} summed to \beta_{\alpha} and that the elements of \alpha were proportional to the support of the respective candidates. Meanwhile, the data vector \mathbf{X} was simply scaled by \beta_{X}.
  • The ground truth polling vector on t_6 (\mathbf{S}_{t_6}) was normalized so that the popularity of the candidates summed to one.
  • The likelihood of the normalized \mathbf{S}_{t_6} vector was calculated under the posterior distribution \mathbf{\theta} \mid X, \beta_{\alpha}, \beta_{X} for each of the intervals, and the average likelihood across all intervals was recorded.
  • The model with the values of \beta_{\alpha} and \beta_{X} that maximized the average likelihood computed above was considered to be the best model to predict future polling numbers.

The grid search described above revealed that the optimal values of \beta_{\alpha} and \beta_{X} were 110,000 and 0.003, respectively. Given that the average L1 norm of \mathbf{X} was roughly 23,000, this meant that the fraction of weight that was placed on the Twitter data was practically zero (\frac{23000 * 0.003}{23000 * 0.003 + 1100000} \approx 0). In other words, given the latest available polls and the Twitter data, the model entirely ignored the Twitter data and exclusively used the latest polls to predict the future polling numbers. Overall, this analysis indicated that the number of positive Tweets about each candidate is totally unrelated to the candidate’s true support.

The notebook that performs this grid search can be found here.