Table of Content¶

  • 1. Required Libraries
  • 2. Initial Data Analysis
    • 2.1. Dataset Overview and Summary
  • 3. Multi-use functions for Word2Vec
  • 4. Word2Vec Pretrained Vector
    • 4.1. Observations
  • 5. LLM (DistilBERT) Pretrained Model
    • 5.1 Observations
  • 6. Model Comparison: Word2Vec vs. DistilBERT

Lib¶

In [ ]:
!pip install --upgrade --force-reinstall --no-cache-dir numpy pandas gensim datasets evaluate
Collecting numpy
  Downloading numpy-2.3.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.1/62.1 kB 2.6 MB/s eta 0:00:00
Collecting pandas
  Downloading pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91.2/91.2 kB 5.6 MB/s eta 0:00:00
Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 298.6 MB/s eta 0:00:00
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 kB 156.7 MB/s eta 0:00:00
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Collecting filelock (from datasets)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-20.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.4-py3-none-any.whl.metadata (4.9 kB)
Collecting tqdm>=4.66.3 (from datasets)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.7/57.7 kB 76.5 MB/s eta 0:00:00
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting huggingface-hub>=0.24.0 (from datasets)
  Downloading huggingface_hub-0.33.0-py3-none-any.whl.metadata (14 kB)
Collecting packaging (from datasets)
  Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
Collecting pyyaml>=5.1 (from datasets)
  Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading aiohttp-3.12.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Collecting typing-extensions>=3.7.4.3 (from huggingface-hub>=0.24.0->datasets)
  Downloading typing_extensions-4.14.0-py3-none-any.whl.metadata (3.0 kB)
Collecting hf-xet<2.0.0,>=1.1.2 (from huggingface-hub>=0.24.0->datasets)
  Downloading hf_xet-1.1.5-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (879 bytes)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas)
  Downloading six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting charset_normalizer<4,>=2 (from requests>=2.32.2->datasets)
  Downloading charset_normalizer-3.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (35 kB)
Collecting idna<4,>=2.5 (from requests>=2.32.2->datasets)
  Downloading idna-3.10-py3-none-any.whl.metadata (10 kB)
Collecting urllib3<3,>=1.21.1 (from requests>=2.32.2->datasets)
  Downloading urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)
Collecting certifi>=2017.4.17 (from requests>=2.32.2->datasets)
  Downloading certifi-2025.6.15-py3-none-any.whl.metadata (2.4 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)
Collecting aiohappyeyeballs>=2.5.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)
Collecting aiosignal>=1.1.2 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading aiosignal-1.3.2-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting attrs>=17.3.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
Collecting frozenlist>=1.1.1 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading frozenlist-1.7.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading multidict-6.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting propcache>=0.2.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading propcache-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting yarl<2.0,>=1.17.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading yarl-1.20.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (73 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73.9/73.9 kB 181.1 MB/s eta 0:00:00
Downloading pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 220.7 MB/s eta 0:00:00
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.7/26.7 MB 248.5 MB/s eta 0:00:00
Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 229.3 MB/s eta 0:00:00
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 491.5/491.5 kB 412.5 MB/s eta 0:00:00
Downloading evaluate-0.4.4-py3-none-any.whl (84 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 kB 335.9 MB/s eta 0:00:00
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 306.0 MB/s eta 0:00:00
Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.6/193.6 kB 365.5 MB/s eta 0:00:00
Downloading huggingface_hub-0.33.0-py3-none-any.whl (514 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 514.8/514.8 kB 391.5 MB/s eta 0:00:00
Downloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.5/143.5 kB 398.7 MB/s eta 0:00:00
Downloading packaging-25.0-py3-none-any.whl (66 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 kB 307.8 MB/s eta 0:00:00
Downloading pyarrow-20.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (42.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.3/42.3 MB 250.2 MB/s eta 0:00:00
Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 229.9/229.9 kB 392.3 MB/s eta 0:00:00
Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 509.2/509.2 kB 414.8 MB/s eta 0:00:00
Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (762 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 763.0/763.0 kB 402.2 MB/s eta 0:00:00
Downloading requests-2.32.4-py3-none-any.whl (64 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.8/64.8 kB 266.9 MB/s eta 0:00:00
Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.6/38.6 MB 185.8 MB/s eta 0:00:00
Downloading smart_open-7.1.0-py3-none-any.whl (61 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.7/61.7 kB 298.4 MB/s eta 0:00:00
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 kB 332.4 MB/s eta 0:00:00
Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 347.8/347.8 kB 367.8 MB/s eta 0:00:00
Downloading filelock-3.18.0-py3-none-any.whl (16 kB)
Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.8/194.8 kB 235.4 MB/s eta 0:00:00
Downloading aiohttp-3.12.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 267.1 MB/s eta 0:00:00
Downloading certifi-2025.6.15-py3-none-any.whl (157 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 157.7/157.7 kB 229.0 MB/s eta 0:00:00
Downloading charset_normalizer-3.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (147 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147.3/147.3 kB 349.4 MB/s eta 0:00:00
Downloading hf_xet-1.1.5-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 353.6 MB/s eta 0:00:00
Downloading idna-3.10-py3-none-any.whl (70 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 kB 234.1 MB/s eta 0:00:00
Downloading six-1.17.0-py2.py3-none-any.whl (11 kB)
Downloading typing_extensions-4.14.0-py3-none-any.whl (43 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.8/43.8 kB 167.1 MB/s eta 0:00:00
Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 kB 197.4 MB/s eta 0:00:00
Downloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (83 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.2/83.2 kB 213.2 MB/s eta 0:00:00
Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl (15 kB)
Downloading aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB)
Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.8/63.8 kB 225.4 MB/s eta 0:00:00
Downloading frozenlist-1.7.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (235 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 235.3/235.3 kB 255.5 MB/s eta 0:00:00
Downloading multidict-6.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (231 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 231.5/231.5 kB 210.5 MB/s eta 0:00:00
Downloading propcache-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 213.5/213.5 kB 239.8 MB/s eta 0:00:00
Downloading yarl-1.20.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (348 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 349.0/349.0 kB 211.7 MB/s eta 0:00:00
Installing collected packages: pytz, xxhash, wrapt, urllib3, tzdata, typing-extensions, tqdm, six, pyyaml, pyarrow, propcache, packaging, numpy, multidict, idna, hf-xet, fsspec, frozenlist, filelock, dill, charset_normalizer, certifi, attrs, aiohappyeyeballs, yarl, smart-open, scipy, requests, python-dateutil, multiprocess, aiosignal, pandas, huggingface-hub, gensim, aiohttp, datasets, evaluate
  Attempting uninstall: pytz
    Found existing installation: pytz 2025.2
    Uninstalling pytz-2025.2:
      Successfully uninstalled pytz-2025.2
  Attempting uninstall: xxhash
    Found existing installation: xxhash 3.5.0
    Uninstalling xxhash-3.5.0:
      Successfully uninstalled xxhash-3.5.0
  Attempting uninstall: wrapt
    Found existing installation: wrapt 1.17.2
    Uninstalling wrapt-1.17.2:
      Successfully uninstalled wrapt-1.17.2
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.4.0
    Uninstalling urllib3-2.4.0:
      Successfully uninstalled urllib3-2.4.0
  Attempting uninstall: tzdata
    Found existing installation: tzdata 2025.2
    Uninstalling tzdata-2025.2:
      Successfully uninstalled tzdata-2025.2
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.14.0
    Uninstalling typing_extensions-4.14.0:
      Successfully uninstalled typing_extensions-4.14.0
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.67.1
    Uninstalling tqdm-4.67.1:
      Successfully uninstalled tqdm-4.67.1
  Attempting uninstall: six
    Found existing installation: six 1.17.0
    Uninstalling six-1.17.0:
      Successfully uninstalled six-1.17.0
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 6.0.2
    Uninstalling PyYAML-6.0.2:
      Successfully uninstalled PyYAML-6.0.2
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 18.1.0
    Uninstalling pyarrow-18.1.0:
      Successfully uninstalled pyarrow-18.1.0
  Attempting uninstall: propcache
    Found existing installation: propcache 0.3.2
    Uninstalling propcache-0.3.2:
      Successfully uninstalled propcache-0.3.2
  Attempting uninstall: packaging
    Found existing installation: packaging 24.2
    Uninstalling packaging-24.2:
      Successfully uninstalled packaging-24.2
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
  Attempting uninstall: multidict
    Found existing installation: multidict 6.4.4
    Uninstalling multidict-6.4.4:
      Successfully uninstalled multidict-6.4.4
  Attempting uninstall: idna
    Found existing installation: idna 3.10
    Uninstalling idna-3.10:
      Successfully uninstalled idna-3.10
  Attempting uninstall: hf-xet
    Found existing installation: hf-xet 1.1.3
    Uninstalling hf-xet-1.1.3:
      Successfully uninstalled hf-xet-1.1.3
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: frozenlist
    Found existing installation: frozenlist 1.7.0
    Uninstalling frozenlist-1.7.0:
      Successfully uninstalled frozenlist-1.7.0
  Attempting uninstall: filelock
    Found existing installation: filelock 3.18.0
    Uninstalling filelock-3.18.0:
      Successfully uninstalled filelock-3.18.0
  Attempting uninstall: dill
    Found existing installation: dill 0.3.7
    Uninstalling dill-0.3.7:
      Successfully uninstalled dill-0.3.7
  Attempting uninstall: charset_normalizer
    Found existing installation: charset-normalizer 3.4.2
    Uninstalling charset-normalizer-3.4.2:
      Successfully uninstalled charset-normalizer-3.4.2
  Attempting uninstall: certifi
    Found existing installation: certifi 2025.6.15
    Uninstalling certifi-2025.6.15:
      Successfully uninstalled certifi-2025.6.15
  Attempting uninstall: attrs
    Found existing installation: attrs 25.3.0
    Uninstalling attrs-25.3.0:
      Successfully uninstalled attrs-25.3.0
  Attempting uninstall: aiohappyeyeballs
    Found existing installation: aiohappyeyeballs 2.6.1
    Uninstalling aiohappyeyeballs-2.6.1:
      Successfully uninstalled aiohappyeyeballs-2.6.1
  Attempting uninstall: yarl
    Found existing installation: yarl 1.20.1
    Uninstalling yarl-1.20.1:
      Successfully uninstalled yarl-1.20.1
  Attempting uninstall: smart-open
    Found existing installation: smart-open 7.1.0
    Uninstalling smart-open-7.1.0:
      Successfully uninstalled smart-open-7.1.0
  Attempting uninstall: scipy
    Found existing installation: scipy 1.15.3
    Uninstalling scipy-1.15.3:
      Successfully uninstalled scipy-1.15.3
  Attempting uninstall: requests
    Found existing installation: requests 2.32.3
    Uninstalling requests-2.32.3:
      Successfully uninstalled requests-2.32.3
  Attempting uninstall: python-dateutil
    Found existing installation: python-dateutil 2.9.0.post0
    Uninstalling python-dateutil-2.9.0.post0:
      Successfully uninstalled python-dateutil-2.9.0.post0
  Attempting uninstall: multiprocess
    Found existing installation: multiprocess 0.70.15
    Uninstalling multiprocess-0.70.15:
      Successfully uninstalled multiprocess-0.70.15
  Attempting uninstall: aiosignal
    Found existing installation: aiosignal 1.3.2
    Uninstalling aiosignal-1.3.2:
      Successfully uninstalled aiosignal-1.3.2
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.33.0
    Uninstalling huggingface-hub-0.33.0:
      Successfully uninstalled huggingface-hub-0.33.0
  Attempting uninstall: aiohttp
    Found existing installation: aiohttp 3.11.15
    Uninstalling aiohttp-3.11.15:
      Successfully uninstalled aiohttp-3.11.15
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.3.0 which is incompatible.
google-colab 1.0.0 requires requests==2.32.3, but you have requests 2.32.4 which is incompatible.
tsfresh 0.21.0 requires scipy>=1.14.0; python_version >= "3.10", but you have scipy 1.13.1 which is incompatible.
pylibcudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.
dask-cudf-cu12 25.2.2 requires pandas<2.2.4dev0,>=2.0, but you have pandas 2.3.0 which is incompatible.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible.
langchain-core 0.3.65 requires packaging<25,>=23.2, but you have packaging 25.0 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
cudf-cu12 25.2.1 requires pandas<2.2.4dev0,>=2.0, but you have pandas 2.3.0 which is incompatible.
cudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-cupti-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-cupti-cu12 12.5.82 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-nvrtc-cu12 12.5.82 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-runtime-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-runtime-cu12 12.5.82 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cudnn-cu12==9.1.0.70; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cudnn-cu12 9.3.0.75 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cufft-cu12==11.2.1.3; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cufft-cu12 11.2.3.61 which is incompatible.
torch 2.6.0+cu124 requires nvidia-curand-cu12==10.3.5.147; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-curand-cu12 10.3.6.82 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cusolver-cu12==11.6.1.9; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cusolver-cu12 11.6.3.83 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cusparse-cu12==12.3.1.170; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cusparse-cu12 12.5.1.3 which is incompatible.
torch 2.6.0+cu124 requires nvidia-nvjitlink-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-nvjitlink-cu12 12.5.82 which is incompatible.
Successfully installed aiohappyeyeballs-2.6.1 aiohttp-3.12.13 aiosignal-1.3.2 attrs-25.3.0 certifi-2025.6.15 charset_normalizer-3.4.2 datasets-3.6.0 dill-0.3.8 evaluate-0.4.4 filelock-3.18.0 frozenlist-1.7.0 fsspec-2025.3.0 gensim-4.3.3 hf-xet-1.1.5 huggingface-hub-0.33.0 idna-3.10 multidict-6.5.0 multiprocess-0.70.16 numpy-1.26.4 packaging-25.0 pandas-2.3.0 propcache-0.3.2 pyarrow-20.0.0 python-dateutil-2.9.0.post0 pytz-2025.2 pyyaml-6.0.2 requests-2.32.4 scipy-1.13.1 six-1.17.0 smart-open-7.1.0 tqdm-4.67.1 typing-extensions-4.14.0 tzdata-2025.2 urllib3-2.5.0 wrapt-1.17.2 xxhash-3.5.0 yarl-1.20.1
In [ ]:
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import gensim.downloader as api
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences, to_categorical
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, Dense
from tensorflow.keras import Sequential
from tensorflow.keras.metrics import AUC, SparseCategoricalAccuracy
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding, create_optimizer,pipeline
from datasets import Dataset
import evaluate
from transformers.keras_callbacks import KerasMetricCallback
import gc
from scipy.special import softmax

Initial Data Analysis¶

In [ ]:
nltk.download("stopwords")
nltk.download("punkt_tab")
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
Out[ ]:
True
In [ ]:
dataset = load_dataset("SetFit/sst5")
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
README.md:   0%|          | 0.00/421 [00:00<?, ?B/s]
Repo card metadata block was not found. Setting CardData to empty.
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
train.jsonl:   0%|          | 0.00/1.32M [00:00<?, ?B/s]
dev.jsonl:   0%|          | 0.00/171k [00:00<?, ?B/s]
test.jsonl:   0%|          | 0.00/343k [00:00<?, ?B/s]
Generating train split:   0%|          | 0/8544 [00:00<?, ? examples/s]
Generating validation split:   0%|          | 0/1101 [00:00<?, ? examples/s]
Generating test split:   0%|          | 0/2210 [00:00<?, ? examples/s]
In [ ]:
train = pd.DataFrame(dataset["train"])
test = pd.DataFrame(dataset["test"])
val = pd.DataFrame(dataset["validation"])
In [ ]:
def initial_analysis(data):

  print(f"************1st five in the dataset************ \n {data.head()}")
  print(f"************Summary Stat************ \n {data.describe()}")
  print(f"************Count of missing Values************\n {data.isnull().sum()}")
  print(f"************Dataset shape************\n {data.shape}")
  print(f"************Duplicated rows count************ \n {data.duplicated().sum()}")
  print(f'************Unqiue label count************ \n {round(data["label_text"].value_counts(normalize=True)*100, 2)}')
In [ ]:
initial_analysis(train)
************1st five in the dataset************ 
                                                 text  label     label_text
0  a stirring , funny and finally transporting re...      4  very positive
1  apparently reassembled from the cutting-room f...      1       negative
2  they presume their audience wo n't sit still f...      1       negative
3  the entire movie is filled with deja vu moments .      2        neutral
4  this is a visually stunning rumination on love...      3       positive
************Summary Stat************ 
              label
count  8544.000000
mean      2.058052
std       1.281570
min       0.000000
25%       1.000000
50%       2.000000
75%       3.000000
max       4.000000
************Count of missing Values************
 text          0
label         0
label_text    0
dtype: int64
************Dataset shape************
 (8544, 3)
************Duplicated rows count************ 
 10
************Unqiue label count************ 
 label_text
positive         27.18
negative         25.96
neutral          19.01
very positive    15.07
very negative    12.78
Name: proportion, dtype: float64
In [ ]:
def nlp_initial_analysis(data):

  stop_words = set(stopwords.words("english"))
  stop_words.update(["movie", "film", "rrb", "lrb"])

  freqdist = " ".join(r.lower() for text in data["text"] for r in word_tokenize(text) if r not in stop_words and r.isalnum())
  freqdist = FreqDist(freqdist.split())
  top10_words = freqdist.most_common(10)
  word, count = zip(*top10_words)
  print(f"Total number of unique words are {len(freqdist)}, and total number of words are {sum(freqdist.values())}")

  plt.figure(figsize=(12,8))
  plt.bar(word, count)
  plt.title("Top 10 most frequent words")
  plt.show()


  reviews_label = list(data["label_text"].value_counts().index)
  for i in reviews_label:

    review_type = data[data["label_text"] == i]
    plt.figure(figsize=(12,8))

    text = " ".join(reviews.lower() for reviews in review_type["text"])
    wordcloud = WordCloud(stopwords=stop_words).generate(text)
    plt.title(f"WordCloud for {i}")
    plt.imshow(wordcloud)
    plt.show()



  plt.pie(data["label_text"].value_counts(normalize = True), labels= data["label_text"].value_counts(normalize = True).index, autopct= "%1.1f%%")
  plt.title("Classification percentage")
  plt.show()
In [ ]:
nlp_initial_analysis(train)
Total number of unique words are 14703, and total number of words are 74949

📝 Dataset Overview and Summary¶

This indicates that the dataset contains a diverse range of sentiments with a relatively balanced spread.


** Count of Missing Values **¶

No missing values were found in the dataset:

Column Missing Count
text 0
label 0
label_text 0

This confirms the dataset is clean and ready for further processing.


** Dataset Shape **¶

  • The dataset contains 8544 rows and 3 columns, indicating a substantial sample size for modeling and analysis.

** Duplicated Rows Count **¶

  • A total of 10 duplicated rows were found in the dataset.
  • These duplicates can be removed to improve data quality before training.

** Unique Label Count **¶

The distribution of sentiment classes based on the label_text column:

Label Text Proportion (%)
Positive 27.18%
Negative 25.96%
Neutral 19.01%
Very Positive 15.07%
Very Negative 12.78%

From this, we observe:

  • The dataset is relatively balanced, with a slight skew towards positive and negative sentiments.
  • All five sentiment categories are well-represented, which supports multi-class classification tasks.

📌 Conclusion:
The dataset is clean, fairly balanced, and ready for exploratory data analysis (EDA), text preprocessing, and sentiment modeling tasks.

WordCloud Sentiment Analysis¶

🔹 Neutral Sentiment¶

  • Dominant Words: one, like, story, time, character, make, little
  • Observations:
    • The language tends to be descriptive and observational.
    • Common words such as "story", "character", and "plot" suggest a focus on narrative structure without strong emotional weight.
    • Words like "good", "enough", and "feel" indicate slight leanings toward judgment, but not definitively polarized.

🔻 Negative Sentiment¶

  • Dominant Words: like, character, bad, even, story, work, little, hard, feel
  • Observations:
    • The word "bad" stands out as an explicitly negative term.
    • Words such as "hard", "little", "nothing", and "problem" suggest dissatisfaction with aspects like story or character development.
    • While some neutral words like "character" and "story" appear, they are framed in more critical contexts.

🔺 Positive Sentiment¶

  • Dominant Words: like, one, work, make, good, story, character, performance, love
  • Observations:
    • There’s a clear presence of positively connoted words such as "good", "love", "performance", and "great".
    • Mentions of "funny", "family", and "heart" suggest emotional and engaging content.
    • Words such as "work" and "make" reflect appreciation for effort and execution.

🔻 Very Negative Sentiment¶

  • Dominant Words: bad, even, dull, minute, character, story, plot, worst, nothing, seem
  • Observations:
    • Strongly negative terms like "bad", "dull", and "worst" dominate the cloud, showing clear dissatisfaction.
    • Frequent use of "minute", "hour", and "time" may suggest boredom or a sense of wasted time.
    • Critical references to "character", "story", and "plot" imply dissatisfaction with the film's core elements.
    • The presence of "even", "seem", and "could" indicates unmet expectations or disappointment.

🔺 Very Positive Sentiment¶

  • Dominant Words: performance, best, funny, work, make, love, story, character, year, comedy
  • Observations:
    • Words like "performance", "best", and "love" reflect strong admiration and praise.
    • Mentions of "funny", "comedy", and "entertaining" suggest a joyful, engaging experience.
    • Frequent appearances of "character", "story", and "director" highlight appreciation of narrative and craftsmanship.
    • Use of "year", "great", and "well" shows the film stood out as a high point among others.

📌 Summary¶

Across all sentiments, common thematic words include "story", "character", and "like", which reflects their central importance in reviews regardless of polarity. The sentiment-specific modifiers (e.g., bad, love, hard, funny) help distinguish the emotional direction of each review.

Multi-use functions for Word2Vec¶

In [ ]:
google_news_model = api.load("word2vec-google-news-300")
[==================================================] 100.0% 1662.8/1662.8MB downloaded
In [ ]:
def remove_stopwords(data):

  stop_words = set(stopwords.words("english"))
  text = " ".join(review  for review in word_tokenize(data) if review not in stop_words and review.isalnum())
  return text
In [ ]:
train["text"] = train["text"].apply(remove_stopwords)
test["text"] = test["text"].apply(remove_stopwords)
val["text"] = val["text"].apply(remove_stopwords)
In [ ]:
def vocab_len_size(data, coverage_threshold):

  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(data)
  vocab_size = len(tokenizer.word_counts)
  sequences = tokenizer.texts_to_sequences(data)

  seq_length = sorted([len(seq) for seq in sequences])
  max_len = np.percentile(seq_length, coverage_threshold)

  return int(max_len), vocab_size
In [ ]:
max_len, vocab_size = vocab_len_size(train["text"], 98)
oov_token = "<OOV>"
pad_type = "post"
trunc_type = "post"
embedding_size = 300
In [ ]:
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(train["text"])
In [ ]:
def texts_to_sequence(data):
  text_to_sequence = tokenizer.texts_to_sequences(data)
  padded = pad_sequences(text_to_sequence, maxlen= max_len, padding=pad_type, truncating=trunc_type)

  return padded
In [ ]:
train_padded = texts_to_sequence(train["text"])
test_padded = texts_to_sequence(test["text"])
val_padded = texts_to_sequence(val["text"])
In [ ]:
def embedding_vector():

  embedding_metrix = np.zeros((vocab_size, embedding_size))

  for word, i in tokenizer.word_index.items():

    if i < vocab_size:

      try:

        embedding_metrix[i] = google_news_model[word]

      except KeyError:
        pass

  return embedding_metrix
In [ ]:
pretrained_vector = embedding_vector()
In [ ]:
y_train_encoded = to_categorical(train["label"], num_classes =5)
y_val_encoded = to_categorical(val["label"], num_classes =5)
y_test_encoded = to_categorical(test["label"], num_classes =5)

Word2Vec Pretrained Vector¶

In [ ]:
model1 = Sequential()
model1.add(Embedding(input_dim = vocab_size, output_dim= embedding_size, weights= [pretrained_vector], trainable = False))
model1.add(Bidirectional(LSTM(128, return_sequences=True)))
model1.add(Bidirectional(LSTM(128, return_sequences=True)))
model1.add(Bidirectional(LSTM(128)))
model1.add(Dense(5, activation="softmax"))
model1.compile(loss = "categorical_crossentropy", optimizer= "adam", metrics=[AUC(multi_label=True, name= "val_auc")])
model1.summary()
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ embedding_1 (Embedding)         │ ?                      │     4,411,500 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ bidirectional_3 (Bidirectional) │ ?                      │   0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ bidirectional_4 (Bidirectional) │ ?                      │   0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ bidirectional_5 (Bidirectional) │ ?                      │   0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ ?                      │   0 (unbuilt) │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 4,411,500 (16.83 MB)
 Trainable params: 0 (0.00 B)
 Non-trainable params: 4,411,500 (16.83 MB)
In [ ]:
earlystopping = EarlyStopping(monitor="val_val_auc", mode="max", restore_best_weights=True, patience=5)
model1.fit(train_padded, y_train_encoded, validation_data=(val_padded, y_val_encoded), epochs=10, batch_size= 64, callbacks=[earlystopping])
Epoch 1/10
134/134 ━━━━━━━━━━━━━━━━━━━━ 21s 104ms/step - loss: 1.4480 - val_auc: 0.6660 - val_loss: 1.3473 - val_val_auc: 0.7362
Epoch 2/10
134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 1.2894 - val_auc: 0.7553 - val_loss: 1.3221 - val_val_auc: 0.7487
Epoch 3/10
134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 1.2300 - val_auc: 0.7798 - val_loss: 1.3366 - val_val_auc: 0.7514
Epoch 4/10
134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 1.2085 - val_auc: 0.7913 - val_loss: 1.3198 - val_val_auc: 0.7553
Epoch 5/10
134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 95ms/step - loss: 1.1381 - val_auc: 0.8163 - val_loss: 1.3828 - val_val_auc: 0.7404
Epoch 6/10
134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 95ms/step - loss: 1.0990 - val_auc: 0.8307 - val_loss: 1.3838 - val_val_auc: 0.7428
Epoch 7/10
134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 95ms/step - loss: 1.0170 - val_auc: 0.8594 - val_loss: 1.4620 - val_val_auc: 0.7297
Epoch 8/10
134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 0.9312 - val_auc: 0.8813 - val_loss: 1.5760 - val_val_auc: 0.7353
Epoch 9/10
134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 0.8274 - val_auc: 0.9069 - val_loss: 1.5752 - val_val_auc: 0.7239
Out[ ]:
<keras.src.callbacks.history.History at 0x7a082c5db550>
In [ ]:
model1.evaluate(test_padded, y_test_encoded)
70/70 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 1.2888 - val_auc: 0.7587
Out[ ]:
[1.2863892316818237, 0.7601470947265625]

📈 Observations¶

  • Improving AUC until around epoch 4, peaking at 0.7553.
  • Training loss decreases consistently, but validation loss increases after epoch 4 → overfitting likely.
  • Non-trainable embedding I have used word2Vec google news pretrained embedding so the embedding layer has been frozen.
  • Test AUC is slightly higher than validation AUC at peak (0.7553), indicating good generalization

LLM (DistilBERT) Pretrained Model¶

In [ ]:
tokenizer_dist = AutoTokenizer.from_pretrained("distilbert-base-uncased")
In [ ]:
def tokenize_function(data):
    return tokenizer_dist(data["text"], truncation=True)
In [ ]:
tokenized_dataset = dataset.map(tokenize_function, batched=True)
In [ ]:
data_collator = DataCollatorWithPadding(tokenizer= tokenizer_dist, return_tensors="tf")
In [ ]:
roc_auc = evaluate.load("roc_auc", "multiclass")
In [ ]:
def compute_metrics(eval_pred):

  preds, label = eval_pred
  probs = softmax(preds, axis=1)
  return roc_auc.compute(prediction_scores=probs, references=label, average="macro", multi_class="ovr")
In [ ]:
label_1 = train[["label","label_text"]].drop_duplicates().set_index("label")["label_text"].to_dict()
label_2 = train[["label_text", "label"]].drop_duplicates().set_index("label_text")["label"].to_dict()
In [ ]:
tf.keras.backend.clear_session()
gc.collect()
model2 = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=5, id2label=label_1, label2id=label_2)
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [ ]:
batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr= 3e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
In [ ]:
tf_train_set = model2.prepare_tf_dataset(tokenized_dataset["train"], shuffle=True, batch_size=16, collate_fn=data_collator)
tf_validation_set = model2.prepare_tf_dataset(tokenized_dataset["validation"], shuffle=False, batch_size=16, collate_fn=data_collator)
tf_test_set = model2.prepare_tf_dataset(tokenized_dataset["test"], shuffle=False, batch_size=16, collate_fn=data_collator)
In [ ]:
model2.compile(optimizer=optimizer)
In [ ]:
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
callbacks = [metric_callback]
In [ ]:
model2.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=5, callbacks=callbacks)
Epoch 1/5
534/534 [==============================] - 79s 116ms/step - loss: 1.2494 - val_loss: 1.1887 - roc_auc: 0.8105
Epoch 2/5
534/534 [==============================] - 50s 93ms/step - loss: 0.9486 - val_loss: 1.1901 - roc_auc: 0.8171
Epoch 3/5
534/534 [==============================] - 49s 93ms/step - loss: 0.6869 - val_loss: 1.3317 - roc_auc: 0.8154
Epoch 4/5
534/534 [==============================] - 50s 93ms/step - loss: 0.4683 - val_loss: 1.5756 - roc_auc: 0.8113
Epoch 5/5
534/534 [==============================] - 49s 92ms/step - loss: 0.3173 - val_loss: 1.6810 - roc_auc: 0.8088
Out[ ]:
<tf_keras.src.callbacks.History at 0x7a33326f82d0>
In [ ]:
test_inputs_only = tf_test_set.map(lambda x, y: x)
test_label_only = tf_test_set.map(lambda x, y: y)
batches = [y.numpy() for y in test_label_only]
y_true = np.concatenate(batches, axis=0)
In [ ]:
result = softmax(model2.predict(test_inputs_only)["logits"], axis=1)
139/139 [==============================] - 4s 29ms/step
In [ ]:
roc_auc.compute(prediction_scores = result, references=y_true, average="macro", multi_class="ovr")
Out[ ]:
{'roc_auc': 0.8277722980808679}
In [ ]:
pipe = pipeline("text-classification", model=model2, tokenizer=tokenizer_dist)
Device set to use 0
In [ ]:
print("*******************Test Run******************************")
print(pipe(test["text"][0]))
print(f'Actual text: {test["text"][0]}, Actual Class: {test["label_text"][0]}')
*******************Test Run******************************
[{'label': 'negative', 'score': 0.8883830904960632}]
Actual text: no movement , no yuks , not much of anything ., Actual Class: negative

📈 Observations¶

  • Best validation ROC AUC observed in epoch 2 (0.8171).
  • Overfitting starts post-epoch 2, as val loss increases while ROC AUC remains relatively stable.
  • The final ROC AUC of 0.8278 indicates strong generalization, even though validation loss increased.

Model Comparison: Word2Vec vs. DistilBERT¶

🤖 Model Comparison: Word2Vec vs. DistilBERT¶


🧱 Model Architecture & Embeddings¶

Feature Model 1: Word2Vec Model 2: DistilBERT
Embedding Type Static pretrained Word2Vec Transformer-based DistilBERT (contextual)
Trainable Embeddings ❌ No (frozen) ✅ Yes (fully fine-tuned)
Model Architecture Embedding → BiLSTM (x3) → Dense DistilBERT → Dense
Trainable Params 0 (all frozen) Fully trainable
Total Params ~4.4M (non-trainable) Likely >60M

📊 Training & Validation Performance¶

Metric Model 1: Word2Vec Model 2: DistilBERT
Best Val AUC 0.7553 (Epoch 4) 0.8171 (Epoch 2)
Final Test AUC 0.7601 0.8278 (macro, full set)
Overfitting Point After Epoch 4 After Epoch 2
Val Loss Trend Increases after Epoch 4 Increases after Epoch 2
Train Loss Trend Smooth decrease Rapid decrease
Training Time ~13s per epoch ~49–79s per epoch

📈 Performance Summary¶

  • Model 2 (DistilBERT) shows significantly better generalization with a final macro AUC of 0.8278, versus 0.7601 for Model 1.
  • While Model 1 underfits due to frozen embeddings and low capacity, Model 2 slightly overfits after 2 epochs but still generalizes better.
  • Word2Vec embeddings are static and miss contextual nuance, while DistilBERT dynamically encodes meaning based on context.

✅ Thoughts¶

  • Prefer DistilBERT (Model 2) when computational resources allow, especially for nuanced or context-rich text classification tasks.
  • Consider fine-tuning Word2Vec-based models if not constrained on compute or model size.

📝 Conclusion: DistilBERT outperforms Word2Vec in both AUC and learning dynamics, thanks to richer embeddings and a more expressive architecture.