This repository focuses on a detailed comparison between human(CDI) and model performance at the lexical level, including:
-
Receptive Vocabulary using a spot-the-word task (adapted from BabySLM
-
Expressive Vocabulary using (un)prompted generations
To get started with this module you will need to have
- a compatible version of python (
python3.10+
). - The enchant library
- phonemizer
You can install the module using the following commands :
git clone https://github.com/Jing-L97/Lexical-benchmark.git
cd Lexical-Benchmark
pip install .
or
git clone https://github.com/Jing-L97/Lexical-benchmark.git
cd Lexical-Benchmark
pip install -e .
For an editable installation (useful during devellopement).
You can also install directly from the git repository (without cloning) using:
pip install git+https://github.com/Jing-L97/Lexical-benchmark.git
adjust-count : Convert word count into accumulated monthly count.
❯ adjust-count --help
usage: adjust-count [-h] [--gen_file GEN_FILE] [--est_file EST_FILE] [--CDI_path CDI_PATH]
[--freq_path FREQ_PATH] [--prompt_type PROMPT_TYPE] [--lang LANG]
[--set SET] [--header_lst HEADER_LST] [--count COUNT]
options:
-h, --help show this help message and exit
--gen_file GEN_FILE
--est_file EST_FILE
--CDI_path CDI_PATH
--freq_path FREQ_PATH
--prompt_type PROMPT_TYPE
--lang LANG
--set SET
--header_lst HEADER_LST
--count COUNT
get-frequencies :
❯ get-frequencies --help
usage: get-frequencies [-h] [--src_file SRC_FILE] [--target_file TARGET_FILE]
[--header HEADER] [--ngram NGRAM]
options:
-h, --help show this help message and exit
--src_file SRC_FILE
--target_file TARGET_FILE
--header HEADER
--ngram NGRAM
match-frequencies :
❯ match-frequencies --help
usage: match-frequencies [-h] [--CDI_path CDI_PATH] [--human_freq HUMAN_FREQ]
[--machine_freq MACHINE_FREQ] [--lang LANG]
[--test_type TEST_TYPE] [--sampling_ratio SAMPLING_RATIO]
[--nbins NBINS]
options:
-h, --help show this help message and exit
--CDI_path CDI_PATH
--human_freq HUMAN_FREQ
--machine_freq MACHINE_FREQ
--lang LANG
--test_type TEST_TYPE
--sampling_ratio SAMPLING_RATIO
--nbins NBINS
dataset-explore :
❯ dataset-explore --help
usage: dataset-explore [-h] [--header HEADER] [--model MODEL] [--prompt PROMPT]
options:
-h, --help show this help message and exit
--header HEADER
--model MODEL
--prompt PROMPT
cf-analysis : ... TBA
create-machine-dataset :
❯ create-machine-dataset --help
usage: create-machine-dataset [-h] [-m MODE] [-f FILE] train_freq_dir out_dir input_filename_path
positional arguments:
train_freq_dir
out_dir
input_filename_path
options:
-h, --help show this help message and exit
-m MODE, --mode MODE
-f FILE, --file FILE
merge-generations : ... TBA
morphology : ... TBA
phonemize-data : ... TBA
train-model : ... TBA
You'll probably want to start from there:
- How to select the test set
- How to run the Accumulator model
- How to evaluate the receptive vocabulary
- How to evaluate the exp vocabulary
src_data: grabbed from the internet
processed data: csv -human -machine loc generated data