Modelling shifting trends over time via topic analysis of text documents

  • Oliver Krauss ,
  • Andrea Aschauer 
  • Andreas Stöckl 
  • Advanced Information Systems and Technology, University of Applied Sciences Upper Austria, Softwarepark 13,
    Hagenberg, 4232, Austria
  • Playful Interactive Environments, University of Applied Sciences Upper Austria, Softwarepark 13, Hagenberg, 4232, Austria
  • Digital Media Department, University of Applied Sciences Upper Austria, Softwarepark 11, Hagenberg, 4232, Austria
Cite as
Krauss O., , Aschauer A., and Stöckl A. (2022).,Modelling shifting trends over time via topic analysis of text documents. Proceedings of the 21st International Conference on Modelling and Applied Simulation MAS 2022). , 009 . DOI: https://doi.org/10.46354/i3m.2022.mas.009

Abstract

Generating new business concepts is an important part of founding new start-ups as well as innovation in existing companies. We identify rising and falling trends in different domains, via topic modelling of text data published over time. Topic modelling is an important tool to classify and cluster documents. Based on the top2vec method, which uses common embeddings of documents and words to not only find but also describe clusters, we have implemented an incremental variant, tracing of growth and decline of these clusters over time. Identification of these trends over large text data collections enables a decision support for innovators, by identifying rising trends, and declining opportunities in different domains. The method was tested and evaluated on the example of arXiv articles. Visualizations of the clusters and descriptions serve to provide people with an interface to identify the trends. In the future, this method can build a foundation of decision support systems that generate innovation ideas based on upcoming trends in research.

References

  1. Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470.
  2. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research,
    3:993–1022.
  3. Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In Pei, J., Tseng, V. S., Cao, L., Motoda, H., and Xu, G., editors, Advances in Knowledge Discovery and Data Mining, pages 160–172, Berlin, Heidelberg. Springer Berlin Heidelberg.
  4. Chen, J., Wei, W., Guo, C., Tang, L., and Sun, L. (2017). Textual analysis and visualization of research trends in data mining for electronic health records. Health Policy and Technology, 6(4):389–400.
  5. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. Di Corso, E., Proto, S., Cerquitelli, T., and Chiusano, S. (2019). Towards automated visualisation of scientific literature. In European Conference on Advances in Databases and Information Systems, pages 28–36. Springer.
  7. Egger, R. and Yu, J. (2022). A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Frontiers in Sociology, 7.
  8. Grootendorst, M. (2020). Bertopic: Leveraging bert and c-tf-idf to create easily interpretable topics.
  9. Hassani, M. (2015). Efficient clustering of big data streams. Apprimus Wissenschaftsverlag.
  10. Hassani, M. (2019). Overview of Efficient Clustering Methods for High-Dimensional Big Data Streams, pages 25–42.
    Springer International Publishing, Cham.
  11. He, J., Li, L., and Wu, X. (2017). A self-adaptive sliding window based topic model for non-uniform texts. In 2017
    IEEE International Conference on Data Mining (ICDM), pages 147–156.
  12. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM
    SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, page 50–57, New York, NY,
    USA. Association for Computing Machinery
  13. Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res.,
    5:1457–1469.
  14. Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., and Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11):15169–15211
  15. Joachims, T. (1996). A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Carnegie mellon university of pittsburgh
  16. Kalsnes, B. (2018). Fake news. In Oxford Research Encyclopedia of Communication
  17. Krstić, Ž., Seljan, S., and Zoroja, J. (2019). Visualization of big data text analytics in financial industry: A case study of topic extraction for italian banks. In Proceedings of the ENTRENOVA-ENTerprise REsearch InNOVAtion Conference, volume 5, pages 35–43.
  18. Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference
    on machine learning, pages 1188–1196. PMLR.
  19. Liu, S. and Jansson, P. (2017). City event detection from social media with neural embeddings and topic model
    visualization. In 2017 IEEE International Conference on Big Data (Big Data), pages 4111–4116. IEEE.
  20. McInnes, L., Healy, J., and Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
  21. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  22. Mühlroth, C. (2020). Artificial intelligence as innovation accelerator. In Proceedings of the 2020 on Computers and People Research Conference, SIGMIS-CPR’20, page 6–7, New York, NY, USA. Association for Computing Machinery.
  23. Nazemi, K., Burkhardt, D., and Kock, A. (2022). Visual analytics for technology and innovation management. Multimed. Tools Appl., 81(11):14803–14830
  24. Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  25. Sievert, C. and Shirley, K. (2014). Ldavis: A method for visualizing and interpreting topics. In Proceedings of the
    workshop on interactive language learning, visualization, and interfaces, pages 63–70
  26. Stöckl, A., Diephuis, J., and Aschauer, A. (2020). Instavis: Visualizing clusters of instagram message feeds. In 2020
    24th International Conference Information Visualisation (IV), pages 435–439. IEEE.
  27. Zou, C. and Hou, D. (2014). Lda analyzer: A tool for exploring topic models. In 2014 IEEE International Conference
    on Software Maintenance and Evolution, pages 593–596. IEEE.