Skip to content

Jing-L97/Lexical-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine CDI: Lexical-benchmark for language acquisition

This repository focuses on a detailed comparison between human(CDI) and model performance at the lexical level, including:

  1. Receptive Vocabulary using a spot-the-word task (adapted from BabySLM

  2. Expressive Vocabulary using (un)prompted generations

Getting started

Installation

To get started with this module you will need to have

  • a compatible version of python (python3.10+).
  • The enchant library
  • phonemizer

You can install the module using the following commands :

git clone https://github.com/Jing-L97/Lexical-benchmark.git
cd Lexical-Benchmark
pip install .

or

git clone https://github.com/Jing-L97/Lexical-benchmark.git
cd Lexical-Benchmark
pip install -e .

For an editable installation (useful during devellopement).

You can also install directly from the git repository (without cloning) using:

pip install git+https://github.com/Jing-L97/Lexical-benchmark.git

Available Commands

adjust-count : Convert word count into accumulated monthly count.

❯ adjust-count --help
usage: adjust-count [-h] [--gen_file GEN_FILE] [--est_file EST_FILE] [--CDI_path CDI_PATH]
                    [--freq_path FREQ_PATH] [--prompt_type PROMPT_TYPE] [--lang LANG]
                    [--set SET] [--header_lst HEADER_LST] [--count COUNT]

options:
  -h, --help            show this help message and exit
  --gen_file GEN_FILE
  --est_file EST_FILE
  --CDI_path CDI_PATH
  --freq_path FREQ_PATH
  --prompt_type PROMPT_TYPE
  --lang LANG
  --set SET
  --header_lst HEADER_LST
  --count COUNT

get-frequencies :

❯ get-frequencies --help
usage: get-frequencies [-h] [--src_file SRC_FILE] [--target_file TARGET_FILE]
                       [--header HEADER] [--ngram NGRAM]

options:
  -h, --help            show this help message and exit
  --src_file SRC_FILE
  --target_file TARGET_FILE
  --header HEADER
  --ngram NGRAM

match-frequencies :

❯ match-frequencies --help
usage: match-frequencies [-h] [--CDI_path CDI_PATH] [--human_freq HUMAN_FREQ]
                         [--machine_freq MACHINE_FREQ] [--lang LANG]
                         [--test_type TEST_TYPE] [--sampling_ratio SAMPLING_RATIO]
                         [--nbins NBINS]

options:
  -h, --help            show this help message and exit
  --CDI_path CDI_PATH
  --human_freq HUMAN_FREQ
  --machine_freq MACHINE_FREQ
  --lang LANG
  --test_type TEST_TYPE
  --sampling_ratio SAMPLING_RATIO
  --nbins NBINS

dataset-explore :

❯ dataset-explore --help
usage: dataset-explore [-h] [--header HEADER] [--model MODEL] [--prompt PROMPT]

options:
  -h, --help       show this help message and exit
  --header HEADER
  --model MODEL
  --prompt PROMPT

cf-analysis : ... TBA

create-machine-dataset :

❯ create-machine-dataset --help
usage: create-machine-dataset [-h] [-m MODE] [-f FILE] train_freq_dir out_dir input_filename_path

positional arguments:
  train_freq_dir
  out_dir
  input_filename_path

options:
  -h, --help            show this help message and exit
  -m MODE, --mode MODE
  -f FILE, --file FILE

merge-generations : ... TBA

morphology : ... TBA

phonemize-data : ... TBA

train-model : ... TBA

Brief description

You'll probably want to start from there:

Result visualization

folder structure

src_data: grabbed from the internet

processed data: csv -human -machine loc generated data

About

This project provides more finegrained human-model comparison on lexical level

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages