Social media and topic modeling: how to analyze posts in practice

There is a substantial amount of data generated on the internet every second — posts, comments, photos, and videos. These different data types mean that there is a lot of ground to cover, so let’s focus on one — text.

All social conversations are based on written words — tweets, Facebook posts, comments, online reviews, and so on. Being a social media marketer, a Facebook group/profile moderator, or trying to promote your business on social media requires you to know how your audience reacts to the content you are uploading. One way is to read it all, mark hateful comments, divide them into similar topic groups, calculate statistics and… lose a big chunk of your time just to see that there are thousands of new comments to add to your calculations. Fortunately, there is another solution to this problem — machine learning. From this text you will learn:

  • Why do you need specialised tools for social media analyses?
  • What can you get from topic modeling and how it is done?
  • How to automatically look for hate speech in comments?

Why are social media texts unique?

Before jumping to the analyses, it is really important to understand why social media texts are so unique:

  • Posts and comments are short. They mostly contain one simple sentence or even single word or expression. This gives us a limited amount of information to obtain just from one post.

Image for post

  • Emojis and smiley faces — used almost exclusively on social media. They give additional details about the author’s emotions and context.

Image for post

  • Slang phrases which make posts resemble spoken language rather than written. It makes statements appear more casual.

Image for post

These features make social media a whole different source of information and demand special attention while running an analysis using machine learning. In contrast, most open-source machine learning solutions are based on long, formal text, like Wikipedia articles and other website posts. As a result, these models perform badly on social media data, because they don’t understand additional forms of expression included. This problem is called domain shift and is a typical NLP problem. Different data also require customised data preparation methods called preprocessing. The step consists of cleaning text from invaluable tokens like URLs or mentions and conversion to machine readable format (more about how we do it in Sotrender). This is why it is crucial to use tools created especially for your data source to get the best results.

Topic Modeling for social media

Machine learning for text analysis (Natural Language Processing) is a vast field with lots of different model types that can gain insight into your data. One of the areas that can answer the question “what are the topics of given pieces of texts?” is topic modeling. These models help with understanding what people are talking about in general. It does not require any specially prepared data set with predefined topics. It can find topics which are patterns hidden within the data on its own without supervision and help — which makes it an unsupervised machine learning method. This means that it is easy to build a model for each individual problem.

There are lots of different algorithms that can be used for this task, but the most common and widely used is LDA (Latent Dirichlet Allocation). It’s based on word frequencies and topics distribution in texts. To put it simply, this method counts words in a given data set and groups them based on their co-occurrence into topics. Then the percentage distribution of topics in each document is calculated. As a result this method assumes that each text is a mixture of topics which works great with long documents where every paragraph relates to a different matter.

Image for post

Figure 1. LDA algorithm (Credit: Columbia University)

That’s why social media texts need a different procedure. One of the new algorithms is GSDMM (Gibbs sampling algorithm for a Dirichlet Mixture Model). What makes this one so different?:

  1. It is fast,
  2. designed for short texts,
  3. easily explained with an analogy of a teacher (algorithm) that wants to divide students (texts) into groups (topics) of similar interests.

Image for post

Figure 2. Group assignment algorithm

Students are told to write down some movie titles they liked within 2 minutes. Most students are able to list 3–5 movies with this time frame (it corresponds to a limited number of words for social media texts). Then they are randomly assigned to a group. The last step is for every student to pick a different table with two rules in mind:

  • pick a group with more students — favours bigger groups
  • or a group with the most similar movie titles — makes groups more cohesive.

This last step is repeated multiple times. First rule that favours bigger groups is crucial to ensure that groups are not excessively fragmented. Due to the limited number of movie titles (words) for each student (text), each group (topic) is bound to have members with different movies in their lists but from the same genre.

As A result of the GSDMM algorithm you obtain an assignment of each text to one topic, as well as a list of the most important words for every topic.

Image for post

Figure 3. Documents assignment to topics and getting topic word

The tricky part is to decide upon number of topic (problem of every unsupervised method) but when you finally do this you can gain quite of a lot of insights from the data:

  • Distribution of topics in your data

Image for post

Figure 4. Topic distribution in data
  • Word Clouds — allows us to comprehend the topic and name it. It is a quick and easy solution that can replace reading the whole set of text and spare you hours of tedious work of dividing it into sets.

Image for post

Figure 5. You can see in the picture above three examples of word clouds. Looking from left to right, the first one contains words: president, government, deasise, covid — we can assume the main theme is politics. There are also less prominent words like cough, sick and health so it’s a topic about government actions regarding health issues.
  • Time series analysis of topics — As we can see in the plot below some topics can gain more attention like number 7 and some of them fade away like number 4. Trying to grasp the idea of what is popular or can be popular in the future is a good thing to look back and see how topics were changing in the past.

Image for post

Figure 6. Distribution of topics over time

Use case

In one of our recent projects for Collegium Civitas we analyzed 50 000 social media posts and comments and performed topic analysis on them. It allowed our client to answer questions like:

1) What was discussed in the time span of 2 months in social media?

In the dataset we were able to distinguish 10 different topics, revolving around Covid-19. Discussions covered statistics and covid-19 etiology, everyday life, government response to pandemic, consequences of limitations in traveling, trade market and supplies, everyday life, health care during pandemic, church and politics, common knowledge and conspiracy theories of Covid-19, politics and economy, spam messages and ads.

2) How were the discussions influenced by the pandemic situation?

During the pandemic burst the biggest theme was the origin and statistics of Covid-19. People talked about how the situation is changing and exchanged information about ways of disease spreading . To read more visit Collegium Civitas’ site (Polish version only).

Hate speech recognition

Another question that can be answered with machine learning is “what kind of emotion do people express in their comments or posts?” or “is my content generating hateful comments?”. There are only a few solutions for these tasks in the Polish language. That is why we build models based on social media text for Sentiment and Hate Speech recognition at Sotrender. Our solutions were built in two steps.

The first step is to convert text and emojis into numerical vector representation (embeddings) to be used later in neural networks. The main goal of this step is to achieve some kind of language model (LM) that has the knowledge of a human language so that vectors representing similar words are close to each other (for example: queen and king or paragraph and article) which implies that these words have similar meaning (semantic similarity). The property is shown on the graph below.

Image for post

Figure 7. The intuition behind word similarity

Training this model is similar to teaching a child how to speak by talking to them. Children by listening to their parents talk are able to grasp the meaning of words and the more they hear the more they understand.

According to this analogy, we have to use a huge set of social media text to train our model to understand its language. That is why we used a set of 100 millions posts and comments to train our model so it can properly assign vectors to words as well as to emojis. Tokens vectorised with an embeddings model provide the input to the neural network.

The second step is designing neural networks for a specific task — Hate speech recognition. The most important thing is the data set — the model needs examples of hate speech and non-hateful texts to learn how to tell them apart. In order to get best results you need to experiment with different architectures and model’s hyperparameters.

As a result of the hate speech recognition model, we get another grouping of our data set. Now we can see how our audience reacts, how many hateful comments or posts it’s creating. What’s more, by combining it again with the time of publication of each comment, we can see if there was a specific time period when the most hateful comments were generated like shown in a histogram below.

Image for post

Figure 8. Hate Speech distribution over time

Combining this distribution with recent posts or events can give you insight into the type of content that provokes people. Also changes of hate speech contribution in time can be related with changes in topic distribution. Combining all the information from analysis can provide an in-depth image of the dataset.

Image for post

Figure 9. Weekly text count with hate speech

As the histogram above shows most hate is connected to topic 3, 6 and 7. Knowing what makes people angry gives the opportunity to avoid sensitive topics in the future.

Same goes for sentiment analysis. We can produce similar visualizations for positive, negative or neutral comments and see their distribution in time or topics. If you would like to read thewhole report build based on our analysis of the 8 weeks of data you can find it here (only Polish version).

Conclusion

In Sotrender we have models for hate speech and sentiment recognition that are constantly improved and updated for social media texts. What’s more we have experience in building topic modeling models for individual cases. As you can see there’s a lot of benefits coming from this type of analysis:

  • Getting to know your audience
  • Having in depth look into topics of comments
  • Discovering trending themes
  • Finding source of hatred or negativity in our content

To name just a few!

References

[1] Yin, Jianhua and Jianyong Wang, A dirichlet multinomial mixture model-based approach for short text clustering, (2014), KDD ’14.

Original post: https://towardsdatascience.com/social-media-and-topic-modeling-how-to-analyze-posts-in-practice-d84fc0c613cb

71 comentários em “Social media and topic modeling: how to analyze posts in practice

  1. Appreciating the commitment you put into your website and
    in depth information you provide. It’s awesome to come across a blog every
    once in a while that isn’t the same out of date rehashed material.
    Great read! I’ve bookmarked your site and I’m adding your RSS feeds to my
    Google account.

  2. Its such as you read my thoughts! You appear to understand
    a lot approximately this, such as you wrote the guide in it
    or something. I feel that you could do with a few % to force
    the message house a bit, however instead of that, this
    is magnificent blog. An excellent read. I will definitely be back.

  3. I absolutely love your blog.. Pleasant colors & theme.

    Did you create this amazing site yourself? Please reply back as I’m looking to create my own website and want to learn where you got this from or exactly what the theme
    is named. Thanks!

  4. I blog often and I really thank you for your content. This article has truly peaked my
    interest. I am going to take a note of your website and keep checking for new information about once a week.
    I opted in for your Feed too.

  5. Hmm it looks like your site ate my first comment (it
    was super long) so I guess I’ll just sum it up what I submitted and say, I’m thoroughly enjoying your blog.
    I as well am an aspiring blog blogger but I’m still new to the whole thing.
    Do you have any tips for inexperienced blog writers? I’d definitely appreciate it.

  6. It’s nearly impossible to find experienced people in this particular subject, but you
    sound like you know what you’re talking about! Thanks

  7. It’s appropriate time to make some plans for the future and
    it’s time to be happy. I’ve read this post and if I could I desire to suggest you some interesting things or advice.
    Maybe you could write next articles referring to this article.
    I want to read even more things about it!

  8. Howdy! I could have sworn I’ve been to this blog before
    but after checking through some of the post I realized it’s new to me.
    Anyhow, I’m definitely glad I found it and I’ll be bookmarking and checking back often!

  9. This is a great tip especially to those fresh to the blogosphere.
    Short but very accurate information… Appreciate your sharing this one.
    A must read article!

  10. I just like the valuable info you supply on your articles.

    I’ll bookmark your blog and test again right here frequently.
    I’m fairly sure I’ll learn a lot of new stuff
    proper right here! Best of luck for the next!

  11. Excellent pieces. Keep posting such kind of information on your page.
    Im really impressed by your blog.
    Hello there, You’ve done a great job. I will definitely digg it and in my opinion suggest to my friends.
    I am confident they’ll be benefited from this website.

  12. Yesterday, while I was at work, my sister stole my
    apple ipad and tested to see if it can survive a twenty five foot drop, just so
    she can be a youtube sensation. My apple ipad is
    now broken and she has 83 views. I know this is completely off topic but I had to share it with someone!

  13. Heya i’m for the first time here. I found this board and I find
    It truly useful & it helped me out much. I hope to give
    something back and help others like you aided me.

  14. Whoa! This blog looks just like my old one! It’s on a completely different topic but it has pretty much the same layout
    and design. Great choice of colors!

  15. Hello, Neat post. There is a problem along with your site in web explorer, could test this?
    IE nonetheless is the market chief and a large component of other folks will
    miss your magnificent writing due to this problem.

  16. Howdy, i read your blog from time to time and i own a similar one and i was just wondering if you
    get a lot of spam remarks? If so how do you prevent it, any
    plugin or anything you can advise? I get so much lately it’s driving me mad so any help is very much appreciated.

  17. Hi, I think your website might be having browser compatibility issues.
    When I look at your blog site in Safari, it looks fine but when opening in Internet Explorer, it has some
    overlapping. I just wanted to give you a quick heads up!
    Other then that, awesome blog!

  18. Hey There. I found your weblog the use of msn. This is a very neatly written article.
    I’ll be sure to bookmark it and return to read
    extra of your helpful info. Thanks for the post.
    I’ll certainly comeback.

  19. Thank you a bunch for sharing this with all of us you
    really recognise what you’re speaking approximately!
    Bookmarked. Please also talk over with my web site =).
    We will have a link change agreement between us

  20. Greate pieces. Keep writing such kind of info on your
    blog. Im really impressed by it.
    Hey there, You have performed an incredible job.
    I will certainly digg it and in my view recommend to
    my friends. I am sure they’ll be benefited from this web site.

  21. I simply couldn’t depart your site prior to suggesting that I really loved the standard information an individual
    provide on your visitors? Is going to be back regularly in order to check out new posts

  22. Yes! Finally something about newest movies out; newest movies out online; watch newest
    movies out online; watch newest movies out;
    where to watch newest movies; where to watch newest movies online; where
    to watch latest movies; where to watch latest movies online free;
    where to watch latest movies online; where to
    watch latest putlocker movies online; where to watch latest putlocker movies; where to watch putlocker movies;
    where to watch free putlocker movies; how to watch free putlocker movies online free; how to watch
    free putlocker movies; newest movie trailers; cool movie trailers;
    good movie trailers; download movie trailers; where to download
    movie trailers; where to download movies free;
    watch movies no ads; watch movies ad free; watch
    movies online ad free; watch movies ad free;.

  23. My relatives always say that I am killing my time here at net, except
    I know I am getting know-how all the time by reading
    thes nice articles or reviews.

  24. Attractive part of content. I simply stumbled upon your weblog and in accession capital to claim that I get in fact loved
    account your blog posts. Anyway I will be subscribing in your augment or even I achievement
    you get admission to persistently rapidly.

  25. Hello there! This is my first visit to your blog! We are
    a group of volunteers and starting a new initiative in a community in the same niche.
    Your blog provided us useful information to work on.
    You have done a wonderful job!

  26. This is a really good tip particularly to those new to
    the blogosphere. Brief but very precise information…
    Many thanks for sharing this one. A must read article!

  27. Hi there! I could have sworn I’ve been to this
    blog before but after browsing through some of the post I realized it’s new to me.
    Nonetheless, I’m definitely happy I found it and I’ll be book-marking and checking back frequently!

  28. Hello I am so thrilled I found your blog, I really found you by
    error, while I was searching on Askjeeve for something else,
    Anyways I am here now and would just like to say cheers for a tremendous post and a all round enjoyable blog (I also love the
    theme/design), I don’t have time to look over it all at the
    minute but I have book-marked it and also added in your RSS feeds,
    so when I have time I will be back to read more, Please do keep
    up the superb work.

  29. What i do not understood is actually how you’re now not really much more
    well-appreciated than you may be right now. You’re very intelligent.
    You already know therefore significantly when it comes to this topic, made me personally
    believe it from so many varied angles. Its
    like men and women aren’t fascinated unless it’s one thing to accomplish with Lady gaga!
    Your own stuffs outstanding. Always take care of it up!

  30. Nice post. I learn something new and challenging on websites I stumbleupon on a daily basis.
    It’s always exciting to read through content from other writers and practice a little something from
    their sites.

  31. Good day! I know this is kind of off topic but I was wondering if you knew
    where I could locate a captcha plugin for my comment form?
    I’m using the same blog platform as yours and I’m having problems finding
    one? Thanks a lot!

  32. Pretty portion of content. I just stumbled upon your weblog and in accession capital to claim that
    I acquire in fact loved account your blog posts. Any way I’ll
    be subscribing for your feeds and even I fulfillment you get
    entry to consistently quickly.

  33. Your style is unique in comparison to other folks I have read stuff from.
    Thank you for posting when you’ve got the opportunity, Guess
    I’ll just bookmark this page.

  34. Hello! I just wanted to ask if you ever have any trouble with hackers?
    My last blog (wordpress) was hacked and I ended
    up losing several weeks of hard work due to no back
    up. Do you have any methods to prevent hackers?

  35. Thanks a bunch for sharing this with all folks you really recognize what you are speaking approximately!

    Bookmarked. Kindly also visit my web site =). We can have a link trade agreement between us

Leave a Reply

Your email address will not be published. Required fields are marked *