TY - JOUR
T1 - QASA
T2 - 40th International Conference on Machine Learning, ICML 2023
AU - Lee, Yoonjoo
AU - Lee, Kyungjae
AU - Park, Sunghyun
AU - Hwang, Dasol
AU - Kim, Jaehyeon
AU - Lee, Hong In
AU - Lee, Moontae
N1 - Publisher Copyright:
© 2023 Proceedings of Machine Learning Research. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Reasoning is the crux of intellectual thinking. While question answering (QA) tasks are prolific with various computational models and benchmark datasets, they mostly tackle factoid or shallow QA without asking deeper understanding. Dual process theory asserts that human reasoning consists of associative thinking to collect relevant pieces of knowledge and logical reasoning to consciously conclude grounding on evidential rationale. Based on our intensive think-aloud study that revealed the three types of questions: surface, testing, and deep questions, we first propose the QASA benchmark that consists of 1798 novel question answering pairs that require full-stack reasoning on scientific articles in AI and ML fields. Then we propose the QASA approach that tackles the full-stack reasoning with large language models via associative selection, evidential rationale-generation, and systematic composition. Our experimental results show that QASA's full-stack inference outperforms the state-of-the-art INSTRUCTGPT by a big margin. We also find that rationale-generation is critical for the performance gain, claiming how we should rethink advanced question answering. The dataset is available at https://github.com/lgresearch/QASA.
AB - Reasoning is the crux of intellectual thinking. While question answering (QA) tasks are prolific with various computational models and benchmark datasets, they mostly tackle factoid or shallow QA without asking deeper understanding. Dual process theory asserts that human reasoning consists of associative thinking to collect relevant pieces of knowledge and logical reasoning to consciously conclude grounding on evidential rationale. Based on our intensive think-aloud study that revealed the three types of questions: surface, testing, and deep questions, we first propose the QASA benchmark that consists of 1798 novel question answering pairs that require full-stack reasoning on scientific articles in AI and ML fields. Then we propose the QASA approach that tackles the full-stack reasoning with large language models via associative selection, evidential rationale-generation, and systematic composition. Our experimental results show that QASA's full-stack inference outperforms the state-of-the-art INSTRUCTGPT by a big margin. We also find that rationale-generation is critical for the performance gain, claiming how we should rethink advanced question answering. The dataset is available at https://github.com/lgresearch/QASA.
UR - http://www.scopus.com/inward/record.url?scp=85174408165&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85174408165
SN - 2640-3498
VL - 202
SP - 19010
EP - 19035
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
Y2 - 23 July 2023 through 29 July 2023
ER -