TY - JOUR
T1 - TC-BERT
T2 - large-scale language model for Korean technology commercialization documents
AU - Kim, Taero
AU - Oh, Changdae
AU - Hwang, Hyeji
AU - Lee, Eunkyeong
AU - Kim, Yewon
AU - Choi, Yunjeong
AU - Kim, Sungjin
AU - Choi, Hosik
AU - Song, Kyungwoo
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
PY - 2025/1
Y1 - 2025/1
N2 - Abstract: Pre-trained language models (LMs) have shown remarkable success in diverse tasks and domains. An LM trained on the document of a specific area (e.g., biomedicine, education, and finance) provides expert-level knowledge about that domain, and there have been many efforts to develop such domain-specific LMs. Despite its potential benefits, however, developing LM in the technology commercialization (TC) domain has not been investigated. In this study, we build a TC-specialized large LM pre-trained on the Korean TC corpus. Firstly, we collect a large-scale dataset containing 199,857,586 general Korean sentences and 17,562,751 TC-related Korean sentences. Second, based on this large dataset, we pre-train a Transformer-based language model resulting in TC-BERT. Third, we validate TC-BERT on three practical applications: document classification, keyword extraction, and recommender system. For this, we devise a new keyword extraction algorithm and propose a document recommender algorithm based on TC-BERT’s document embedding. Through various quantitative and qualitative experiments, we comprehensively verify TC-BERT’s effectiveness and its application. Graphic Abstract: (Figure presented.)
AB - Abstract: Pre-trained language models (LMs) have shown remarkable success in diverse tasks and domains. An LM trained on the document of a specific area (e.g., biomedicine, education, and finance) provides expert-level knowledge about that domain, and there have been many efforts to develop such domain-specific LMs. Despite its potential benefits, however, developing LM in the technology commercialization (TC) domain has not been investigated. In this study, we build a TC-specialized large LM pre-trained on the Korean TC corpus. Firstly, we collect a large-scale dataset containing 199,857,586 general Korean sentences and 17,562,751 TC-related Korean sentences. Second, based on this large dataset, we pre-train a Transformer-based language model resulting in TC-BERT. Third, we validate TC-BERT on three practical applications: document classification, keyword extraction, and recommender system. For this, we devise a new keyword extraction algorithm and propose a document recommender algorithm based on TC-BERT’s document embedding. Through various quantitative and qualitative experiments, we comprehensively verify TC-BERT’s effectiveness and its application. Graphic Abstract: (Figure presented.)
KW - BERT
KW - Keyword extraction
KW - Language model
KW - Natural language processing
KW - Technology commercialization
UR - http://www.scopus.com/inward/record.url?scp=85210321597&partnerID=8YFLogxK
U2 - 10.1007/s11227-024-06597-6
DO - 10.1007/s11227-024-06597-6
M3 - Article
AN - SCOPUS:85210321597
SN - 0920-8542
VL - 81
JO - Journal of Supercomputing
JF - Journal of Supercomputing
IS - 1
M1 - 163
ER -