TC-BERT: large-scale language model for Korean technology commercialization documents

Taero Kim, Changdae Oh, Hyeji Hwang, Eunkyeong Lee, Yewon Kim, Yunjeong Choi, Sungjin Kim, Hosik Choi, Kyungwoo Song

Research output: Contribution to journalArticlepeer-review

Abstract

Abstract: Pre-trained language models (LMs) have shown remarkable success in diverse tasks and domains. An LM trained on the document of a specific area (e.g., biomedicine, education, and finance) provides expert-level knowledge about that domain, and there have been many efforts to develop such domain-specific LMs. Despite its potential benefits, however, developing LM in the technology commercialization (TC) domain has not been investigated. In this study, we build a TC-specialized large LM pre-trained on the Korean TC corpus. Firstly, we collect a large-scale dataset containing 199,857,586 general Korean sentences and 17,562,751 TC-related Korean sentences. Second, based on this large dataset, we pre-train a Transformer-based language model resulting in TC-BERT. Third, we validate TC-BERT on three practical applications: document classification, keyword extraction, and recommender system. For this, we devise a new keyword extraction algorithm and propose a document recommender algorithm based on TC-BERT’s document embedding. Through various quantitative and qualitative experiments, we comprehensively verify TC-BERT’s effectiveness and its application. Graphic Abstract: (Figure presented.)

Original languageEnglish
Article number163
JournalJournal of Supercomputing
Volume81
Issue number1
DOIs
StatePublished - Jan 2025

Keywords

  • BERT
  • Keyword extraction
  • Language model
  • Natural language processing
  • Technology commercialization

Fingerprint

Dive into the research topics of 'TC-BERT: large-scale language model for Korean technology commercialization documents'. Together they form a unique fingerprint.

Cite this