Our Code

write something about when the time periods are and stuff like that - write conclusion - what can we see when comparing all the wordclouds? is there anything interesting?

Website:

https://almafaz.github.io/index.html#hero

Github Repository:

https://github.com/almafaz/almafaz.github.io

Group Contributions

Group members: Simone von Mehren (s190739), Natasha Norsker (s194270) and Alma Fazlagic (s194271)

All group members contributed equally, but the main contributers on the sub-sections were:

Explainer Notebook

Motivation

The Ukraine-Russia tweet data

In this project we worked with a huge dataset of tweets discussing the on-going war between Ukraine and Russia. The dataset building the foundation for this project was originally retrieved from the Github repo of ehsanulhaq1/russo_ukraine_dataset and consists of tweet ids for all tweets containing certain key words relevant to the war in the period from 21st February 2022 to 15th April 2022. Using the tweet ids, we were able to download the corresponding tweets from Twitter as well as a number of attributes including the user name, language, date of creation, etc. Twitter is used by 229 million users daily (www.aljazeera.com) and is thereby an extensive source, if one wants insights about the public opinion of some matter. The original dataset collected by ehsanulhaq consists of 3 million+ tweets distributed over 54 days and we subsampled one million tweets for the analysis and data visualisations in this project.

Some of the million reasons why this data is interesting

The Russian invasion of Ukraine has shook many countries and has severely affected economies, families and political relationships across the world. As an effect of Russia's endless war crimes, threat to Ukraine's independence and strained relationship to NATO, many countries have imposed sanctions on Russia resulting in price increases and a massive decrease in delivery of gas from Russia to the world. More importantly, more than 14.000 Ukranians have lost their lives and many more has become refugees in Europe.

All of these reasons makes it interesting and highly relevant to analyse the public opinion and discourse towards Russia and the war in general - and Twitter is a great source to access this. We want to investigate whether there has been any change in the discourse and sentiment towards Ukraine and Russia during the period of the war, and whether there are any differences in the political and public opinion from one country to another.

Furthermore, we will investigate the nature of how users interact in Ukrainian and Russian by doing a network analysis, where we look at both clustering trends and community detection.

What's in it for you?

In terms of our goal for the end users' experiences, we wanted to create an interactive webpage which invited the users to dig in to the investigation themselves. Overall we wanted to present an overview of what people are saying about the war on Twitter, which main subjects are discussed and who people interact with within their spoken language.

We aimed at presenting our findings in an user-friendly and comprehensible way using a variety of data visualisations and interactive network graphs. The webpage was designed to put emphasis on the time/language aspects of the analysis making it easy for the user to understand our results and focus on one area at a time. The interactive parts of the webpage invites the user to engage in the analysis themselves and furthermore it was our hope that these features would make the experience fun and more appealing to the user.

Basic stats. Let's get to know our data.

The following sections of code include everything used for data preprocessing and cleaning. Due to the high memory usage and heavy computations, everything has been run on the hpc and is simply displayed here for documentation purposes.

Dataset

The original dataset include tweet ids retrieved from Twitter using the Twitter streaming Api and include tweets from each day in the period 21-02-2022 to 15-04-2022. The dataset is available at the github repo: ehsanulhaq1/russo_ukraine_dataset, where the authors are also listed.

The tweet ids have been collected based on a range of keywords relevant to the war such as (russia, ukraine, putin, zelensky, russian, ukrainian, keiv, kyiv). However, the list of keywords were continuously updated by the authors during the period of the data collection, as more and more keywords became relevant for the crisis during the times of download. For access to the complete list of keywords, we refer to the original github (see link above). We reached out to the owner of the GitHub with the purpose of retrieving the specific keywords he used. Unfortunately he was not able to get back to us before the hand-in of the assignment. The Twitter streaming API tracked the requested key words and provided any tweet that contained any of the key words.

Tweet ids

The first snippet of code seen below has been used to download the tweet ids from the original github repo. To make sure we had a uniform distribution of data during the whole period, we extracted 18.000 randomly chosen tweet ids from the github repo for each day. This yielded a dataset of approximately 1 million tweets (even though the original dataset included 3 million+ tweets). The reduction of the size of the dataset was a necessity in this project, as it would have been infeasible for us to use the entire original dataset due to heavy computations and high memory usage.

We are, however, aware of the fact that our final dataset is only a subset of the original dataset and therefore only represents a subset of the discussion of the war on twitter. In order to cross section our subset of data to be as representative as possible of the full dataset, we extract the tweet ids for each day randomly (i.e. randomized on tweet upload time), as we already mentioned briefly earlier. Other factors contribute to this limitation as well. For example the key words used when collecting the tweet ids from Twitter controls which tweets are included and excluded from dataset. It is, however, obvious that many tweets within the subject of the Ukraine-Russian war will not include these keywords and therefore we must be aware of this limitation when making any conclusions based on our dataset.

Downloading the tweets

After retrieving the tweet ids, we use the Twitter Api v2 to collect the corresponding tweets from Twitter and saved them all in a .csv-file.

Our dataset consists of 18.000 tweets distributed over 54 days, and the dataset include tweets from 651.363 users in total. As the majority of the users do not allow tracking of geolocation, we were not able to save the countries/locations of the users and tweets as first intended. Instead we tracked the language used in each tweet and in total we found 32 languages.

We saved the following 8 attributes for each tweet:

The full text and hashtags were saved to be used in our text and discourse analysis later on. We used the creation date and written language of the tweet to filter the tweets on time and language, respectively, when performing the text and network analysis, and the tweet id and parent author were used to generate several social network graphs during the analysis.

Cleaning the data

After extracting all tweets, we performed multiple steps of data cleaning before using the data for analysis. The following major steps were performed on all tweets in our data cleaning-process:

The code used for this process can be seen below.

Understanding the volume volatility of the data

We created a plot over the volume of tweets per day for the english, ukranian and russian language respectively. This was done to get an overview of the volume for the different languages we were analysing. The actual volume graph was created in HTML, but the data for the graph was extracted in python and the extraction code can be seen below.

Tweet Volume Plot

It is easy to see that the volume of tweets written in english far succeeds the tweets written in ukranian and russian respectively. This is only expected as the english language has an overall higher volume of people speaking it. However, this leads to english hashtag trends and keywords possibly dominating in the analysis where we are not splitting by language.

Tools, theory and analysis. From theory to insights.

Text analysis

Dispersion Plot of Important Words

Dispersion Plot

Analyses of dispersion plot

Our initial intention with the dispersion plot was to use it to locate events or trends in time regarding the war. However, after numerous trials with different words we had to conclude that the dispersion plot does not show significant differences between the magnitude of usage of words.

We do, however, see a few interesting tendencies in the dispersion plot above. First of all, the use of the Ukranian city name "Vasylkiv" peaks around the 25th-28th February 2022, then again around 12th March 2022 and lastly in the period of 1st-4th April 2022. These dates (/periods in time) corresponds to the dates of the initial and second attack on the city and the declaration of liberation from Russian invaders, respectively (www.theguardian.com).

Furthermore the word "glory" was used mostly by the Russians around February 28th 2022, which was a few days after the first attacks in the Russian invasion of Ukraine and this may be imply that the Russians were rather optimistic at this point in time. The word "economy" was used significantly more by the Russians than the other languages on March 8th 2022 which is appriximately a week after the West started to impose economical sanctions on Russia(ns). It can also be seen that the expression "nuclear war" seemed to be used more by the Russians around April 1st. We did not find any particular big events regarding this on the exact date but the debate of whether a nuclear war was a significant threat did rise during April 2022 so there may be some connection here.

Wordshift Graphs

how they were made

Russian vs Ukranian Wordshift

Word Shift

Analysis of word shifts for Russian and Ukranian

To clarify what can be seen in the plot, the light colors represent the words that have been less frequently used in ukranian tweets than in russian tweets and the dark colors respresent words that are more frequently used in ukranian than in russian.

Yellow indicates positive words, while blue indicates negative words.

As can be seen by the light yellow color, the word 'russia' is not as commonly used by tweets written in ukranian, than they are in russian tweets. It can also be seen that it is counted as a positive word. Since the intention of the tweet is not necessarily for 'russia' to be positive, this might skew the overall result of the wordshift so that the overall ukranian tweets are counted more negatively than they perhaps should have been.

To fix this issue, we would either have to have some contextual information and assign 'russia' and 'russian' our own sentiment score depending on the context or we would have to remove the word completely. Both of these seemed to be subpar solutions, since the decrease in the use of 'russia' and 'russian' is still a valuable insight and creating a context specific sentiment score for only the words 'russian' and 'russia' would make the wordshift unnecessarily complicated.

It can be seen that ukrainian tweets more often include the word 'glory'. Ukrain has a war slogan 'Slava Ukraini' which translates to 'Glory to Ukraine' in english. 'win','victory' and 'enemy' is also more frequently used by ukrainian tweets.

Russian tweets more often use the negative words 'bomb', 'operation', 'killed', 'attacked' than ukrainian tweets do. They also more often use the positive words like 'russia', 'russian', 'like', 'special', where, as previously stated the 'russia' and 'russian' is slightly more ambiguous in terms of what the sentiment score should be.

Overall ukrainian tweets have a higher sentiment score than russian tweets, although the shift in sentiment score is not large.

First Two Weeks vs Last Two Weeks Wordshift

Word Shift

Analysis of Time Wordshifts

To clarify what can be seen in the plot, the light colors represent the words that have been less frequently used in the last two weeks than in the first two weeks and the dark colors respresent words that are more frequently used in the last two weeks than in the first two weeks.

Yellow indicates positive words, while blue indicates negative words.

As can be seen by the light blue color, the words 'crimes', 'rape' and 'raped' was more frequently used in the first two weeks. However, the words 'stop', 'refugees', 'nuclear' and 'hospital' are more frequently used now. This might be because there is a larger amount of refugees now than there was in the first two weeks of the war and the topic of nuclear war and nuclear explosions are more prevalent now ever since the attack near Chernobyl.

The word 'plant' is more frequently used now, but it is not easy to determine if this is referring to crops or nuclear power plants without more context.

There is a slightly higher overall sentiment in these previous two weeks of data than in the first two weeks. However, this difference is very slight - only 0.01 shift in sentiment score.

Word clouds using TF-IDF scores

We constructed several word clouds filtered on time and language, respectively, in order to display and highlight relevant words in debate on Twitter. The word clouds were generated based on TF-IDF scores in order to highlight, not simply the most frequent used words, but the most relevant words for the specific language or time period.

The term frequency (TF) is computed using the following formular: $$ \large \mathrm{TF}=\frac{\text { count of } \mathrm{t} \text { in } \mathrm{d}}{\text { total number of words in } \mathrm{d}} $$ where t is the term and d denotes the document.

The IDF formula is as follows: $$ \large \mathrm{IDF}=\log \left(\frac{\mathrm{N}}{\mathrm{df}+1}\right) $$ where N is the total number of documents, df is the document frequency (number of documents in which term t occurs) and 1 is the shrink term.

Usually it is smart to add a shrink term to the denominator (+1 for example) to avoid division by zero.

However, in our case we are certain that all terms is present at least once in one of the documents and therefore we can omit the shrink term (which in return ensures no negative IDF values). Our final IDF formula used in the assignment is therefore the following:

$$ \large \mathrm{IDF}=\log \left(\frac{\mathrm{N}}{\mathrm{df}}\right) $$

Three word clouds were constructed filtered on language; English, Ukranian and Russian. In the same way, three word clouds were constructed filtered on time in the periods; 21st Feb 2022 - 10th March 2022, 11th March 2022 - 29th March 2022 and 30th March 2022 - 15th April 2022.

In the sections below, the code used for wordcloud generation can be seen. It should be noted that a random subsample of 50.000 data points were used to generate the wordclouds, as it was infeasible in terms of computation time to use all of the 1 million data points. This does of course mean that the corpus only represents a subset of 5% of the original which will most likely have an effect on the IDF scores especially. However, we assume that this is still a representative cross section of the data for the purpose of this specific part of the analysis.

Wordclouds filtered on language: English, Ukranian and Russian

Drawing

Drawing

Drawing

Analysis of language wordclouds

English

Looking at the English wordcloud we can get a sense of what words/subjects are discussed in general all over the world, as English is the language most often used to communicate between different nationalities. Some of the most prominent words are "russia(n)", "ukraine(ian)", "zelensky", "putin", "biden" and "trump", which are all very general in terms of the war. In other words the presence of these words could indicate that the political situation and factual development of the war is what is discussed the most across nationalities. This seems sensible as this is the ways in which most countries are affected by the war and therefore this is probably what is interesting to most people worldwide. Other words which represent more specific subjects include "gas", "oligarchs", "nato", "oil", "fight", "support" and "kyiv". This is most likely representing the consequences of the western sanctions towards Russia which have caused a shortage of gas and oil supply as well as price increases. Also it is highly debated to what extend NATO should help Ukraine and the presence of this word could indicate that people are expressing their opinion about this matter online. Lastly words like "please", "fight" and "kyiv" are included in the wordcloud. We cannot conclude anything regarding who is fighting for/supporting who based on these stand-alone words. However, the presence of the word "kyiv" indicates that a majority of the users expressing themselves in English are supporting Ukraine, as the choice of spelling Kyiv in the Ukrainian way has become a symbol of supporting the country during the war.

Russian

Looking at the Russian wordcloud, many of the words from the English wordclouds such as "ukraine(ian)", "russia(n)", "zelensky" and "putin" recur in here. However, some distinct words are present as well. These include "terrorist", "operation", "oppression", "belarus", "access", "sanctions" and many more. Once again it is hard to say anything certain about the dicourse and opinions towards these subjects based on single words but we can infer that subjects such as the development in the invasion of Ukraine and the effects of the sanctions towards their country are some of the focuses of the Russians.

Ukranian

One of the most distinctive and informative wordclouds in terms of sentiment and discourse are the Ukrainian. Here we see (probably) hashtags such as "stoprussia" and "standwithukraine" which of course expresses sympathy with their country. Words such as "children" and "feminism" may refer to the war crimes committed by Russia such as killing civilians (including targetting schools and kinder gardens) and rape of the women, and words like "freedom", "please" and "world" may symbolise the Ukranians' cry and hope for help from the rest of the world. It should be emphaised that this is just suggestions and we cannot conclude anything confidently about the meanings behind the wordclouds based entirely on singe words.

Wordclouds filtered on time period: 21/2-10/3, 11/3-29/3 and 30/3-15/4

Drawing

Drawing

Drawing

Analysis of time period wordclouds

Looking at the wordclouds based on relevant words for different time periods, we hoped to see a change of discourse and/or subject as time proceeded. Looking at the first period of the war spanning from the Russian Invasion of Ukraine until March 10th, we see that some of the most prominent words are "williamson", "edinburgh" and "absolutist". The first refer to Hugh Williamson, the Europe and Central Asia director at Human Rights Watch, who in the first days of the war expressed deep concerns towards Russia's act of war crimes (www.hrw.org) and the term "edinburgh" may be rated as relevant because Ukranians joined in huge demonstrations in Scotland against the attacks on their homelands during the first days of the war (www.bbc.com). "Absolutist" can refer to different specific cases or simply the meaning of the term itself. One specific case which was trending on March 7th was when Elon Musk refused to block Russian news sources, which resulted in many people accusing him of being a "free speech absolutist" (www.economictimes.indiatimes.com).

Turning our attention towards the second wordcloud from March 11th to March 29th, the some of the largest displayed words are "birzamanlar", "ibrahimcelikkol", "zulhak", "russianassettracker" (probably a hashtag) and "schwarzenegger". When investigating these words, we found that some words seemed more related to the war than others. For example "birzamanlar" and "ibrahimcelikkol" is a Turkish expression and actor, respectively, which do not have any obvious connection to the war. On the other hand the expression "russian-asset-tracker" is the name of a global investigation into assets held outside Russia linked to individuals sanctioned for supporting the government of Russia (www.theguardian.com) which were kicked off on April 8th 2022. Furthermore Arnold Schwarznegger held a speech on March 18th, where he publicly asked Putin to stop the war (www.bbc.com).

In the last wordcloud representing the period from March 30th until April 15th, it is very obvious to see that the focus of many tweets regarding the war is the rapid increase in war crimes performed by Russia. Words like burcha-massacre, rape and acid is in focus which most likely refers to killings and rapes in Bucha during the weeks of April (www.kyivindependent.com) as well as the public suspense of whether Russia has used chemical weapons in Ukraine (www.19fortyfive.com).

All in all the wordclouds based on TF-IDF scores gives insights into the historical and political developments during the war, and more specifically the TF-IDF scores seems to catch the more specific stories and trends rather than the overall most frequently used words.

Word Shift

The bar chart is originally an animation displayed on our webpage but it was only possible to illustrate a still-image here (due to the implementation in HTML). We therefore refer to our webpage to see the live animation.

Analyses of dynamic bar chart

Looking at the bar chart from day to day, we see a general persistent trend in hashtags such as Ukraine and Russia. Furthermore many hashtags expressing sympathy with Ukraine is trending throughout the period. Examples of these are "StandWithUkraine" and "StopRussia".

One of the very interesting things we find when studying the development in hashtag-trends is that different Ukranian city names are trending around the same day as their were attacked by Russia. This means that the bar chart provides us with a rather precise timeline of the attacks of Ukrainian cities. On the 24/02/2022 we see that "Kyiv" is trending which corresponds to the day that Russia initiated attacks on the Kyiv Oblast (www.edition.cnn.com). The same applies on 02/03/2022 where Kharkiv was trending and the Russian army captured Kharkiv (www.nytimes.com), Mariupol on 10/03/2022, where the city was bombed (www.theguardian.com) and lastly "Bucha" and "Buchamassacre" on 04/04/2022 and the following days. Bucha was visited on 04/04 by Ukranian president Zelensky after the killings, rapings and torturement of many Ukranian civilians.

All in all it seems that the majority, if not all, of the top trending hashtags on Twitter regarding the war expresses sympathy with Ukraine in this time of war. It is therefore probable that the majority of Twitter users sympathises with Ukraine. Still it cannot be excluded that some nations may use other social media platforms than Twitter and therefore we cannot conclude that the majority of social media users in general supports Ukraine.

Twitter Networks

In the social network aspect part of this assignment, we will investigate Twitter network graphs based on the language of users and parent authors in Ukrainian and Russian. We will first examine to which degree users cluster together and how the network clusters compares to a null model. Then we will look even further into these clusters and investigate how communities in the graphs are formed and how strong they are - for this we will use modularity as measure as well as comparing the graphs with configuration models.

Due to computational limitations, an in-depth analysis of the English Twitter network is not possible. If we were to use the whole dataset to create the English betwork, we would have around 500.000 nodes which is infeasible to work with. For the Ukrainian and Russian networks, all data points are used - there is 3.562 and 10.339 data points respectively for the two groups. We tried sampling the English tweets (40.000 tweets), but decided that rather than drawing conlusions based on an analysis of a subset instead of the whole dataset, we would focus our attention on the two other language based networks.

Creating the graphs

Unlike the GME graph, we will not build an reciprocal undirected graph from the weighted edgelists - this is because in the data collected, only tweets containing the keywords specified earlier are fetched, which limits us to get all replies to a single tweet. Furthermore, the Twitter API does not allow us to get all replies since this requires a Premium account. This means that there will not be connections that appear in both directions. This is something we should have considered in the beginning as limitation, but since a big part of Twitter is retweeting rather than actually replying (which would never result in connections in both directions), building non-reciprocal graphs may still be a good representation of the network.

This gives us the following three undirected networks:

Ukrainian Twitter Network:

Russian Twitter Network:

Random Networks as Null Models

Now we will create Random Networks as null models to investigate some properties of the Twitter Networks.

We will create random networks based on the R(N, p) model introduced by Edgar Nelson Gilbert: Each pair of N labeled nodes is connected with probability p. Computing N and p from the network we are investigating is done by the following:

The formula for the for the expected number of edges in the random network is (given in section 3.3 of the Network Science book ): $$ \langle L\rangle=\sum_{L=0}^{\frac{N(N-1)}{2}} L p_{L}=p \frac{N(N-1)}{2} $$

Setting $\langle L\rangle$ equal to the actual number of edges in the given network, the probabily p can be found by: $$ p = \frac{\langle L\rangle}{\frac{N(N-1)}{2}} $$

This gives us the following Random Networks:

Random Ukrainian Twitter Network:

Random Russian Twitter Network:

Using the random networks, we can investigate the clustering trends within the Ukrainian and Russian Twitter networks. Computing the local clustering coefficient of a particular node allows us to capture the relationship between the neighbors of that node and relates then to the density of the neighboorhood - the more densely connected neighboorhood surrounding the mode, the higher is the coefficient. The local clustering coefficient is given section 2.10 of the Network Science book and is defined as:

$$ C_{i}=\frac{2 L_{i}}{k_{i}\left(k_{i}-1\right)} $$

where $L_i$ is the number of links between the $k_i$ neighbors of node $i$.

We can then use this formula to compute the average clustering coefficient of the whole network: $$ \langle C\rangle=\frac{1}{N} \sum_{i=1}^{N} C_{i} $$

Cluster coefficient for the Ukrainian Twitter Networks:

Cluster coefficient for the Russian Twitter Networks:

Visualizing the Twitter networks and their random counterpart

Ukrainian: Original vs. Random Graph

Russian: Original vs. Random Graph

What do these findings tell us?

In compariring the Ukrainian and Russian cluster coefficients, we see that the Russian coefficent is a little over 2 times bigger than the Ukrainian. In the two networks, the distribution of original, replies and retweets are about 15%, 6% and 79% for Ukrainian tweets and 18%, 11% and 71% for Russian tweets, which could explain why the Russian clustering coefficent is higher than that of Ukraine - it seems that people reply to each other more often rather than retweeting in the Russian network. However we must of course be aware that Russian is also spoken in Ukraine (the language distribution is about 68% Ukrainian and 30% Russian according to Wikipedia).

In both cases, the clustering coefficient for the random network is actually higher than that of the original graphs, which tells us something about users replying to a tweet, will not usually continue to the talk with each other in the replies section for instance. Especially when retweeting is such a big part of twitter - if many people retweet the same tweet, they of course have the same parent author, but a cluster will not form around them since they do not reply to eachother. We can relate this to how the orignal networks compare to their random counterpart visually - in both networks, the number of edges are the roughly the same in the original and random networks, but the number of nodes is higher in the random graph. This could explain why the original networks look more sparse than their random counterpart.

This clustering trend makes sense if we consider how useres generally interact on Twitter. The structure we typically will see in Twitter networks is for instance a certain tweet will go viral, which many will retweet, reply to and quote tweet - then these replies and quote tweeets will be retweeted and replied to and so on and it will continue almost in an exponential manner, meaning that people will not form one big discussion forum in the comments of the original tweet, but rather spread the message in a very fast and "broad" way. Furthermore, the connections we observe in these networks are not the same connections we see in a social network for instance - the links do not correspond to friendships but rather a reaction a user might have on something another user stated. Parent authors will not be enclined to respond in the same way as on Reddit for instance - if the Twitter account of big television network posts a tweet about some current news regarding the War, they will most likely not responds to users in their comment section. This means we will not typically observe triadic closures in these Twitter networks compared to a social network of friends, family and co-workers.

Comparing the trend of how people interact on Twitter with Reddit, we know that there is much more of a sense of community on Reddit - when people will interact in subreddits talking about a specific topic and from answering each other back and fourth, they are much more likely to become friends or acquaintances at least. On Twitter however, seeing as the we investigate a difficult and widespread topic, there is not the same sense of community as on Reddit.

Investigating Communities in the Twitter Networks

Even though we do not see the same strong sense of community on Twitter compared to Reddit, we can still investigate interesting properties of the unique community structures (if there exists such structures).

Modularity

We can investigate the communities in the Twitter networks by computing the modularity values, which helps us measure how well a certain partition is at detecting communities. The higher these values are, the stronger are the created communities. This is because the observed number of links between nodes in the community is higher than the expected number of links, and we can infer that a certain cluster structure must be present.

The modularity is a number between $[-1,1]$ - a value close to 1 indicates a strong partitioning and if we assign the whole network to a single community, the modularity value becomes 0 and a negative modularity value occurs for instance when we assign each node to a their own individual community.

Computing the modularity of several partition allows us to compare them with eachother and optimizing this modularity value helps us to detect even stronger communities.

Based on equation 9.12 in section 9.4 of the Network Science book, the formula for modularity is: $$ M=\sum_{c=1}^{n_{c}}\left[\frac{L_{c}}{L}-\left(\frac{k_{c}}{2 L}\right)^{2}\right] $$

where $n_c$ is the number of communities in the partition, $N_c$ and $L_c$ is the number of nodes and edges respectively in each of the $n_c$ communities. Furthermore, $k_c$ is the total degree of nodes in in a community $C_c$ and L is the number of links in the entire graph. We formulate this in the following function:

Community Partitions

To find communities in the Twitter Networks, we use the Python Louvain-algorithm implementation and we can use these partitions to color the nodes of the networks accordingly.

Modularity value for Ukrainian Twitter Graph: 0.92265

Modularity value for Russian Twitter Graph: 0.88728

Visualizing the communities in the Ukrainian network:

Visualizing the communities in the Russian network:

Statistical Evaluation of the Modularities

Before we can conclude anything regarding the found communities based on their modularity value, we must asses if the modularity computed is statitically different from $0$. To do this we implement a configuration model, which will create a new network, such that the degree of each node is the same as in the original network, but with different links. The algorithm works in the following way:

This is implemented in the following function:

Using this algorithm we create $1000$ randomized version of the networks and for each of them, we compute the modularity of the partitions.

We can plot these modularities for each network:

The 95th percentile for both tests are:

Community stats

Ukraine:

Russia:

In both cases the modularities of the configuration models are centered around zero, which correspond to assigning the network to a single community - meaning there are not communities present. Visualizing the modularities and computing the 95th percentile of the modularities allows to clearly conclude that the modularity values of 0.923 and 0.887 for the Ukrainian and Russian network respectively are significantly different from 0. This means that the partitions found by the Python Louvain algorithm are better than assigning no communities at all or assigning each node to its own community.

In investigating some of these communities, we see for instance in the Ukrainian network that the following three users have the highest in-degrees:

Authors with the highest in-degrees in the Russian network:

When looking at these the visualizations, coloring the nodes based on the community partitions allows us to detect some cluster structures. Some of the bigger cluster structures are results from the neighborhood between the mentioned parent authors above and the users reacting to their tweets. As perhaps expected, most of the clusters come from users retweeting and replying to bigger news stations or organizations in both networks. We do not typically see private people engage in discussion of the same size as for the news stations, but as discussed earlier, this is likely becuase of the nature that is tweeting and replying to each other that differs from that trend we see on Reddit.

Interactive Ukrainian and Russian Twitter Networks

For visualization purposes, we also create interactive versions of the Ukrainian and Russian networks, where the reader is able to investigate the communities and read what they are tweeting about. Due to computaional limitations, these networks will be a subset of the original two Twitter networks.

Ukrainian Interactive Network

Russian Interactive Network

We create the interactive graphs using the Pyvis library. Note that parent authors will not have a tweet attached to them, since they can have multiple tweets.

The interactive networks can be found on our website!

Discussion. What worked and what didn't. And why.

Reduction and incompleteness of the dataset

One of the most important limitations to address in this project is the size and "incompletenes" of our dataset. The original dataset consisted of more than 3 million data points which is an extensively large amount of data. However, it will still lack some tweets addressing the Russia-Ukraine war, if they do not contain the specified key words in the data collection query. Moreover we reduced the dataset further, when we subsampled 18.000 tweets per day due to computational limits, yielding an even bigger loss of tweets discussing the subject. We intended to use our final dataset of 1 million tweets for all analysis purposes. However, in practice this became infeasible due to time and computational limits, and therefore we ended up using an even smaller part of the dataset for some parts of our analysis. This reduction of our dataset is important to keep in mind when making any conclusions on the basis of this data, as it can only serve as a representative of the true trends and conclusions. However, we argue that the data partitions used for all parts of the analysis is still a representative cross section of the complete debate on Twitter and the trends associated with it, as we subsampled our data uniformly across all dates and randomly within each day.

Lack of location data

At first hand our plan was to use the (geo-)location data associated with each tweet to seperate networks and parts of the text analyses on nationality. Anyhow, during the time of the project work we realised that most Twitter users do not allow tracking of geo-location and the location data provided by the users themselves were filled with bugs and unusable information. As a workaround we therefore used the written language of the tweet instead. This can still give some indications of trends within nationalities although the analyses was now limited to the tweets written by people who expressed themselves in their mother language. Moreover, the English written language is used by the majority of users across all nationalities which makes the distribution of tweets for each language significantly skewed. Still we found it interesting to investigate the trends within Ukrainian and Russian as well as the general public using English, and we concluded that it was still somehow representative of the subsequent nationalities because of the great amount of data.

Skewed distribution of tweet

Looking at the tweet volume volatility for the languages English, Russian and Ukranian, we found that there were significantly more English tweets compared to the others. This is important to be aware of when we conclude anything within the time domain, as the English written tweets will dominate the tendencies significantly. In other words when drawing any conclusions about the overall tendencies based on time, it will be much more representative for the English speaking "community" on Twitter than any other. However, as this also indicates that the majority of Twitter users communicate on English, we can argue that the time analysis and conclusions are still representative of the majority of Twitter users i.e. the overall public opinions and trends. This bias is removed in the language analysis, as we filter the tweets on language and only consider tweets within one written language at a time.

No significant results in dispersion plot

Our initial intention with the dispersion plot was to show some kind of development in the usage of certain words over time. However, as we had to specify the words displayed in the plot manually, the process became somewhat trial-and-error based. We expected that using the most frequently used/general key words in the debate would not give any significant results, as these words are used continously in time. Therefore we experimented with more specific words representing certain events such as "nuclear war" and "chernobyl" and expected to see a higher usage of these words around the period in time, where the events took place. Still, this was not the case in practice and therefore the dispersion plot was rather uninformative in the end.

Keywords introduce bias

One rather important bias in our analysis and dataset is the use of keywords in the data collection. As the tweets were queried based on these specific key words, the volume of tweets containing these words will also unnaturally high. This is an important flaw in our analysis when investigating any kind of trends based on frequency, as many of the key words will logically appear as very popular. Examples of analysis where this bias is significant are the bar chart of most popular hashtags or the word shift plots. A work-around could have been to remove all key words from our dataset but this might have resulted in other important tendencies getting lost instead.

Building the Twitter networks

As mentioned when building the Twitter Networks, one limitation of our data set is that the tweets we fetch must contain the keywords "russia, ukraine, putin, zelensky" etc. - this means we are not able to follow an entire Twitter thread if the replies to the original tweet does not contain any of these keywords. For our networks this limitation means we will almost never see triadic closures between the parent author and the users who reply to their tweet. Even if we wanted to fecth all replies like we did with the "search_comments" function from the Reddit API, we would need access to a Premium Account on Twiiter API. However, we argue that even if we had all replies, the nature of how twitter users behave is still represented in our networks - users do not often gather in the same thread and stay for long enough to reply to each other, but rather scatter throughout the conversation, almost in an exponential manner as mentioned earlier.

Furthermore, one could expand these networks so they are not based on the interaction between users and parent authors of a particular language, but rather on how users gather in groups discussing certain topics that might arise in this difficult subject matter. This would require a form of NLP-based sentiment analysis on top of the network analysis, which could prove itself to be difficult to carry out, but also very interesting to investigate.