Machine CDI: Lexical-benchmark for language acquisition

This repository focuses on a detailed comparison between human(CDI) and model performance at the lexical level, including:

Receptive Vocabulary using a spot-the-word task (adapted from BabySLM
Expressive Vocabulary using (un)prompted generations

Getting started

Installation

To get started with this module you will need to have

a compatible version of python (python3.10+).
The enchant library
phonemizer

You can install the module using the following commands :

git clone https://github.com/Jing-L97/Lexical-benchmark.git
cd Lexical-Benchmark
pip install .

or

git clone https://github.com/Jing-L97/Lexical-benchmark.git
cd Lexical-Benchmark
pip install -e .

For an editable installation (useful during devellopement).

You can also install directly from the git repository (without cloning) using:

pip install git+https://github.com/Jing-L97/Lexical-benchmark.git

Available Commands

adjust-count : Convert word count into accumulated monthly count.

❯ adjust-count --help
usage: adjust-count [-h] [--gen_file GEN_FILE] [--est_file EST_FILE] [--CDI_path CDI_PATH]
                    [--freq_path FREQ_PATH] [--prompt_type PROMPT_TYPE] [--lang LANG]
                    [--set SET] [--header_lst HEADER_LST] [--count COUNT]

options:
  -h, --help            show this help message and exit
  --gen_file GEN_FILE
  --est_file EST_FILE
  --CDI_path CDI_PATH
  --freq_path FREQ_PATH
  --prompt_type PROMPT_TYPE
  --lang LANG
  --set SET
  --header_lst HEADER_LST
  --count COUNT

get-frequencies :

❯ get-frequencies --help
usage: get-frequencies [-h] [--src_file SRC_FILE] [--target_file TARGET_FILE]
                       [--header HEADER] [--ngram NGRAM]

options:
  -h, --help            show this help message and exit
  --src_file SRC_FILE
  --target_file TARGET_FILE
  --header HEADER
  --ngram NGRAM

match-frequencies :

❯ match-frequencies --help
usage: match-frequencies [-h] [--CDI_path CDI_PATH] [--human_freq HUMAN_FREQ]
                         [--machine_freq MACHINE_FREQ] [--lang LANG]
                         [--test_type TEST_TYPE] [--sampling_ratio SAMPLING_RATIO]
                         [--nbins NBINS]

options:
  -h, --help            show this help message and exit
  --CDI_path CDI_PATH
  --human_freq HUMAN_FREQ
  --machine_freq MACHINE_FREQ
  --lang LANG
  --test_type TEST_TYPE
  --sampling_ratio SAMPLING_RATIO
  --nbins NBINS

dataset-explore :

❯ dataset-explore --help
usage: dataset-explore [-h] [--header HEADER] [--model MODEL] [--prompt PROMPT]

options:
  -h, --help       show this help message and exit
  --header HEADER
  --model MODEL
  --prompt PROMPT

cf-analysis : ... TBA

create-machine-dataset :

❯ create-machine-dataset --help
usage: create-machine-dataset [-h] [-m MODE] [-f FILE] train_freq_dir out_dir input_filename_path

positional arguments:
  train_freq_dir
  out_dir
  input_filename_path

options:
  -h, --help            show this help message and exit
  -m MODE, --mode MODE
  -f FILE, --file FILE

merge-generations : ... TBA

morphology : ... TBA

phonemize-data : ... TBA

train-model : ... TBA

Brief description

You'll probably want to start from there:

Result visualization

folder structure

src_data: grabbed from the internet

processed data: csv -human -machine loc generated data

Name		Name	Last commit message	Last commit date
Latest commit History 384 Commits
docs		docs
lm_benchmark		lm_benchmark
.gitignore		.gitignore
README.md		README.md
notes.txt		notes.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine CDI: Lexical-benchmark for language acquisition

Getting started

Installation

Available Commands

Brief description

Result visualization

folder structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Jing-L97/Lexical-benchmark

Folders and files

Latest commit

History

Repository files navigation

Machine CDI: Lexical-benchmark for language acquisition

Getting started

Installation

Available Commands

Brief description

Result visualization

folder structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages