!pip install --upgrade --force-reinstall --no-cache-dir numpy pandas gensim datasets evaluate
Collecting numpy Downloading numpy-2.3.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (62 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.1/62.1 kB 2.6 MB/s eta 0:00:00 Collecting pandas Downloading pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91.2/91.2 kB 5.6 MB/s eta 0:00:00 Collecting gensim Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB) Collecting datasets Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB) Collecting evaluate Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB) Collecting python-dateutil>=2.8.2 (from pandas) Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB) Collecting pytz>=2020.1 (from pandas) Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB) Collecting tzdata>=2022.7 (from pandas) Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB) Collecting numpy Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 298.6 MB/s eta 0:00:00 Collecting scipy<1.14.0,>=1.7.0 (from gensim) Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 kB 156.7 MB/s eta 0:00:00 Collecting smart-open>=1.8.1 (from gensim) Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB) Collecting filelock (from datasets) Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB) Collecting pyarrow>=15.0.0 (from datasets) Downloading pyarrow-20.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.3 kB) Collecting dill<0.3.9,>=0.3.0 (from datasets) Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB) Collecting requests>=2.32.2 (from datasets) Downloading requests-2.32.4-py3-none-any.whl.metadata (4.9 kB) Collecting tqdm>=4.66.3 (from datasets) Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.7/57.7 kB 76.5 MB/s eta 0:00:00 Collecting xxhash (from datasets) Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB) Collecting multiprocess<0.70.17 (from datasets) Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB) Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets) Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB) Collecting huggingface-hub>=0.24.0 (from datasets) Downloading huggingface_hub-0.33.0-py3-none-any.whl.metadata (14 kB) Collecting packaging (from datasets) Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) Collecting pyyaml>=5.1 (from datasets) Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets) Downloading aiohttp-3.12.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB) Collecting typing-extensions>=3.7.4.3 (from huggingface-hub>=0.24.0->datasets) Downloading typing_extensions-4.14.0-py3-none-any.whl.metadata (3.0 kB) Collecting hf-xet<2.0.0,>=1.1.2 (from huggingface-hub>=0.24.0->datasets) Downloading hf_xet-1.1.5-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (879 bytes) Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas) Downloading six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB) Collecting charset_normalizer<4,>=2 (from requests>=2.32.2->datasets) Downloading charset_normalizer-3.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (35 kB) Collecting idna<4,>=2.5 (from requests>=2.32.2->datasets) Downloading idna-3.10-py3-none-any.whl.metadata (10 kB) Collecting urllib3<3,>=1.21.1 (from requests>=2.32.2->datasets) Downloading urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB) Collecting certifi>=2017.4.17 (from requests>=2.32.2->datasets) Downloading certifi-2025.6.15-py3-none-any.whl.metadata (2.4 kB) Collecting wrapt (from smart-open>=1.8.1->gensim) Downloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB) Collecting aiohappyeyeballs>=2.5.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB) Collecting aiosignal>=1.1.2 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) Downloading aiosignal-1.3.2-py2.py3-none-any.whl.metadata (3.8 kB) Collecting attrs>=17.3.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) Collecting frozenlist>=1.1.1 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) Downloading frozenlist-1.7.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB) Collecting multidict<7.0,>=4.5 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) Downloading multidict-6.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB) Collecting propcache>=0.2.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) Downloading propcache-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB) Collecting yarl<2.0,>=1.17.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.3.0,>=2023.1.0->datasets) Downloading yarl-1.20.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (73 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73.9/73.9 kB 181.1 MB/s eta 0:00:00 Downloading pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 220.7 MB/s eta 0:00:00 Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.7/26.7 MB 248.5 MB/s eta 0:00:00 Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 229.3 MB/s eta 0:00:00 Downloading datasets-3.6.0-py3-none-any.whl (491 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 491.5/491.5 kB 412.5 MB/s eta 0:00:00 Downloading evaluate-0.4.4-py3-none-any.whl (84 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 kB 335.9 MB/s eta 0:00:00 Downloading dill-0.3.8-py3-none-any.whl (116 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 306.0 MB/s eta 0:00:00 Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.6/193.6 kB 365.5 MB/s eta 0:00:00 Downloading huggingface_hub-0.33.0-py3-none-any.whl (514 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 514.8/514.8 kB 391.5 MB/s eta 0:00:00 Downloading multiprocess-0.70.16-py311-none-any.whl (143 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.5/143.5 kB 398.7 MB/s eta 0:00:00 Downloading packaging-25.0-py3-none-any.whl (66 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 kB 307.8 MB/s eta 0:00:00 Downloading pyarrow-20.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (42.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.3/42.3 MB 250.2 MB/s eta 0:00:00 Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 229.9/229.9 kB 392.3 MB/s eta 0:00:00 Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 509.2/509.2 kB 414.8 MB/s eta 0:00:00 Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (762 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 763.0/763.0 kB 402.2 MB/s eta 0:00:00 Downloading requests-2.32.4-py3-none-any.whl (64 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.8/64.8 kB 266.9 MB/s eta 0:00:00 Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.6/38.6 MB 185.8 MB/s eta 0:00:00 Downloading smart_open-7.1.0-py3-none-any.whl (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.7/61.7 kB 298.4 MB/s eta 0:00:00 Downloading tqdm-4.67.1-py3-none-any.whl (78 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 kB 332.4 MB/s eta 0:00:00 Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 347.8/347.8 kB 367.8 MB/s eta 0:00:00 Downloading filelock-3.18.0-py3-none-any.whl (16 kB) Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.8/194.8 kB 235.4 MB/s eta 0:00:00 Downloading aiohttp-3.12.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 267.1 MB/s eta 0:00:00 Downloading certifi-2025.6.15-py3-none-any.whl (157 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 157.7/157.7 kB 229.0 MB/s eta 0:00:00 Downloading charset_normalizer-3.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (147 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147.3/147.3 kB 349.4 MB/s eta 0:00:00 Downloading hf_xet-1.1.5-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 353.6 MB/s eta 0:00:00 Downloading idna-3.10-py3-none-any.whl (70 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 kB 234.1 MB/s eta 0:00:00 Downloading six-1.17.0-py2.py3-none-any.whl (11 kB) Downloading typing_extensions-4.14.0-py3-none-any.whl (43 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.8/43.8 kB 167.1 MB/s eta 0:00:00 Downloading urllib3-2.5.0-py3-none-any.whl (129 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 kB 197.4 MB/s eta 0:00:00 Downloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (83 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.2/83.2 kB 213.2 MB/s eta 0:00:00 Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl (15 kB) Downloading aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB) Downloading attrs-25.3.0-py3-none-any.whl (63 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.8/63.8 kB 225.4 MB/s eta 0:00:00 Downloading frozenlist-1.7.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (235 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 235.3/235.3 kB 255.5 MB/s eta 0:00:00 Downloading multidict-6.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (231 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 231.5/231.5 kB 210.5 MB/s eta 0:00:00 Downloading propcache-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 213.5/213.5 kB 239.8 MB/s eta 0:00:00 Downloading yarl-1.20.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (348 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 349.0/349.0 kB 211.7 MB/s eta 0:00:00 Installing collected packages: pytz, xxhash, wrapt, urllib3, tzdata, typing-extensions, tqdm, six, pyyaml, pyarrow, propcache, packaging, numpy, multidict, idna, hf-xet, fsspec, frozenlist, filelock, dill, charset_normalizer, certifi, attrs, aiohappyeyeballs, yarl, smart-open, scipy, requests, python-dateutil, multiprocess, aiosignal, pandas, huggingface-hub, gensim, aiohttp, datasets, evaluate Attempting uninstall: pytz Found existing installation: pytz 2025.2 Uninstalling pytz-2025.2: Successfully uninstalled pytz-2025.2 Attempting uninstall: xxhash Found existing installation: xxhash 3.5.0 Uninstalling xxhash-3.5.0: Successfully uninstalled xxhash-3.5.0 Attempting uninstall: wrapt Found existing installation: wrapt 1.17.2 Uninstalling wrapt-1.17.2: Successfully uninstalled wrapt-1.17.2 Attempting uninstall: urllib3 Found existing installation: urllib3 2.4.0 Uninstalling urllib3-2.4.0: Successfully uninstalled urllib3-2.4.0 Attempting uninstall: tzdata Found existing installation: tzdata 2025.2 Uninstalling tzdata-2025.2: Successfully uninstalled tzdata-2025.2 Attempting uninstall: typing-extensions Found existing installation: typing_extensions 4.14.0 Uninstalling typing_extensions-4.14.0: Successfully uninstalled typing_extensions-4.14.0 Attempting uninstall: tqdm Found existing installation: tqdm 4.67.1 Uninstalling tqdm-4.67.1: Successfully uninstalled tqdm-4.67.1 Attempting uninstall: six Found existing installation: six 1.17.0 Uninstalling six-1.17.0: Successfully uninstalled six-1.17.0 Attempting uninstall: pyyaml Found existing installation: PyYAML 6.0.2 Uninstalling PyYAML-6.0.2: Successfully uninstalled PyYAML-6.0.2 Attempting uninstall: pyarrow Found existing installation: pyarrow 18.1.0 Uninstalling pyarrow-18.1.0: Successfully uninstalled pyarrow-18.1.0 Attempting uninstall: propcache Found existing installation: propcache 0.3.2 Uninstalling propcache-0.3.2: Successfully uninstalled propcache-0.3.2 Attempting uninstall: packaging Found existing installation: packaging 24.2 Uninstalling packaging-24.2: Successfully uninstalled packaging-24.2 Attempting uninstall: numpy Found existing installation: numpy 2.0.2 Uninstalling numpy-2.0.2: Successfully uninstalled numpy-2.0.2 Attempting uninstall: multidict Found existing installation: multidict 6.4.4 Uninstalling multidict-6.4.4: Successfully uninstalled multidict-6.4.4 Attempting uninstall: idna Found existing installation: idna 3.10 Uninstalling idna-3.10: Successfully uninstalled idna-3.10 Attempting uninstall: hf-xet Found existing installation: hf-xet 1.1.3 Uninstalling hf-xet-1.1.3: Successfully uninstalled hf-xet-1.1.3 Attempting uninstall: fsspec Found existing installation: fsspec 2025.3.2 Uninstalling fsspec-2025.3.2: Successfully uninstalled fsspec-2025.3.2 Attempting uninstall: frozenlist Found existing installation: frozenlist 1.7.0 Uninstalling frozenlist-1.7.0: Successfully uninstalled frozenlist-1.7.0 Attempting uninstall: filelock Found existing installation: filelock 3.18.0 Uninstalling filelock-3.18.0: Successfully uninstalled filelock-3.18.0 Attempting uninstall: dill Found existing installation: dill 0.3.7 Uninstalling dill-0.3.7: Successfully uninstalled dill-0.3.7 Attempting uninstall: charset_normalizer Found existing installation: charset-normalizer 3.4.2 Uninstalling charset-normalizer-3.4.2: Successfully uninstalled charset-normalizer-3.4.2 Attempting uninstall: certifi Found existing installation: certifi 2025.6.15 Uninstalling certifi-2025.6.15: Successfully uninstalled certifi-2025.6.15 Attempting uninstall: attrs Found existing installation: attrs 25.3.0 Uninstalling attrs-25.3.0: Successfully uninstalled attrs-25.3.0 Attempting uninstall: aiohappyeyeballs Found existing installation: aiohappyeyeballs 2.6.1 Uninstalling aiohappyeyeballs-2.6.1: Successfully uninstalled aiohappyeyeballs-2.6.1 Attempting uninstall: yarl Found existing installation: yarl 1.20.1 Uninstalling yarl-1.20.1: Successfully uninstalled yarl-1.20.1 Attempting uninstall: smart-open Found existing installation: smart-open 7.1.0 Uninstalling smart-open-7.1.0: Successfully uninstalled smart-open-7.1.0 Attempting uninstall: scipy Found existing installation: scipy 1.15.3 Uninstalling scipy-1.15.3: Successfully uninstalled scipy-1.15.3 Attempting uninstall: requests Found existing installation: requests 2.32.3 Uninstalling requests-2.32.3: Successfully uninstalled requests-2.32.3 Attempting uninstall: python-dateutil Found existing installation: python-dateutil 2.9.0.post0 Uninstalling python-dateutil-2.9.0.post0: Successfully uninstalled python-dateutil-2.9.0.post0 Attempting uninstall: multiprocess Found existing installation: multiprocess 0.70.15 Uninstalling multiprocess-0.70.15: Successfully uninstalled multiprocess-0.70.15 Attempting uninstall: aiosignal Found existing installation: aiosignal 1.3.2 Uninstalling aiosignal-1.3.2: Successfully uninstalled aiosignal-1.3.2 Attempting uninstall: pandas Found existing installation: pandas 2.2.2 Uninstalling pandas-2.2.2: Successfully uninstalled pandas-2.2.2 Attempting uninstall: huggingface-hub Found existing installation: huggingface-hub 0.33.0 Uninstalling huggingface-hub-0.33.0: Successfully uninstalled huggingface-hub-0.33.0 Attempting uninstall: aiohttp Found existing installation: aiohttp 3.11.15 Uninstalling aiohttp-3.11.15: Successfully uninstalled aiohttp-3.11.15 Attempting uninstall: datasets Found existing installation: datasets 2.14.4 Uninstalling datasets-2.14.4: Successfully uninstalled datasets-2.14.4 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.3.0 which is incompatible. google-colab 1.0.0 requires requests==2.32.3, but you have requests 2.32.4 which is incompatible. tsfresh 0.21.0 requires scipy>=1.14.0; python_version >= "3.10", but you have scipy 1.13.1 which is incompatible. pylibcudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible. dask-cudf-cu12 25.2.2 requires pandas<2.2.4dev0,>=2.0, but you have pandas 2.3.0 which is incompatible. gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible. langchain-core 0.3.65 requires packaging<25,>=23.2, but you have packaging 25.0 which is incompatible. thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible. cudf-cu12 25.2.1 requires pandas<2.2.4dev0,>=2.0, but you have pandas 2.3.0 which is incompatible. cudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible. torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible. torch 2.6.0+cu124 requires nvidia-cuda-cupti-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-cupti-cu12 12.5.82 which is incompatible. torch 2.6.0+cu124 requires nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-nvrtc-cu12 12.5.82 which is incompatible. torch 2.6.0+cu124 requires nvidia-cuda-runtime-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-runtime-cu12 12.5.82 which is incompatible. torch 2.6.0+cu124 requires nvidia-cudnn-cu12==9.1.0.70; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cudnn-cu12 9.3.0.75 which is incompatible. torch 2.6.0+cu124 requires nvidia-cufft-cu12==11.2.1.3; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cufft-cu12 11.2.3.61 which is incompatible. torch 2.6.0+cu124 requires nvidia-curand-cu12==10.3.5.147; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-curand-cu12 10.3.6.82 which is incompatible. torch 2.6.0+cu124 requires nvidia-cusolver-cu12==11.6.1.9; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cusolver-cu12 11.6.3.83 which is incompatible. torch 2.6.0+cu124 requires nvidia-cusparse-cu12==12.3.1.170; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cusparse-cu12 12.5.1.3 which is incompatible. torch 2.6.0+cu124 requires nvidia-nvjitlink-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-nvjitlink-cu12 12.5.82 which is incompatible. Successfully installed aiohappyeyeballs-2.6.1 aiohttp-3.12.13 aiosignal-1.3.2 attrs-25.3.0 certifi-2025.6.15 charset_normalizer-3.4.2 datasets-3.6.0 dill-0.3.8 evaluate-0.4.4 filelock-3.18.0 frozenlist-1.7.0 fsspec-2025.3.0 gensim-4.3.3 hf-xet-1.1.5 huggingface-hub-0.33.0 idna-3.10 multidict-6.5.0 multiprocess-0.70.16 numpy-1.26.4 packaging-25.0 pandas-2.3.0 propcache-0.3.2 pyarrow-20.0.0 python-dateutil-2.9.0.post0 pytz-2025.2 pyyaml-6.0.2 requests-2.32.4 scipy-1.13.1 six-1.17.0 smart-open-7.1.0 tqdm-4.67.1 typing-extensions-4.14.0 tzdata-2025.2 urllib3-2.5.0 wrapt-1.17.2 xxhash-3.5.0 yarl-1.20.1
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import gensim.downloader as api
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences, to_categorical
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, Dense
from tensorflow.keras import Sequential
from tensorflow.keras.metrics import AUC, SparseCategoricalAccuracy
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding, create_optimizer,pipeline
from datasets import Dataset
import evaluate
from transformers.keras_callbacks import KerasMetricCallback
import gc
from scipy.special import softmax
nltk.download("stopwords")
nltk.download("punkt_tab")
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip. [nltk_data] Downloading package punkt_tab to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt_tab.zip.
True
dataset = load_dataset("SetFit/sst5")
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
README.md: 0%| | 0.00/421 [00:00<?, ?B/s]
Repo card metadata block was not found. Setting CardData to empty. WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
train.jsonl: 0%| | 0.00/1.32M [00:00<?, ?B/s]
dev.jsonl: 0%| | 0.00/171k [00:00<?, ?B/s]
test.jsonl: 0%| | 0.00/343k [00:00<?, ?B/s]
Generating train split: 0%| | 0/8544 [00:00<?, ? examples/s]
Generating validation split: 0%| | 0/1101 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/2210 [00:00<?, ? examples/s]
train = pd.DataFrame(dataset["train"])
test = pd.DataFrame(dataset["test"])
val = pd.DataFrame(dataset["validation"])
def initial_analysis(data):
print(f"************1st five in the dataset************ \n {data.head()}")
print(f"************Summary Stat************ \n {data.describe()}")
print(f"************Count of missing Values************\n {data.isnull().sum()}")
print(f"************Dataset shape************\n {data.shape}")
print(f"************Duplicated rows count************ \n {data.duplicated().sum()}")
print(f'************Unqiue label count************ \n {round(data["label_text"].value_counts(normalize=True)*100, 2)}')
initial_analysis(train)
************1st five in the dataset************ text label label_text 0 a stirring , funny and finally transporting re... 4 very positive 1 apparently reassembled from the cutting-room f... 1 negative 2 they presume their audience wo n't sit still f... 1 negative 3 the entire movie is filled with deja vu moments . 2 neutral 4 this is a visually stunning rumination on love... 3 positive ************Summary Stat************ label count 8544.000000 mean 2.058052 std 1.281570 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 4.000000 ************Count of missing Values************ text 0 label 0 label_text 0 dtype: int64 ************Dataset shape************ (8544, 3) ************Duplicated rows count************ 10 ************Unqiue label count************ label_text positive 27.18 negative 25.96 neutral 19.01 very positive 15.07 very negative 12.78 Name: proportion, dtype: float64
def nlp_initial_analysis(data):
stop_words = set(stopwords.words("english"))
stop_words.update(["movie", "film", "rrb", "lrb"])
freqdist = " ".join(r.lower() for text in data["text"] for r in word_tokenize(text) if r not in stop_words and r.isalnum())
freqdist = FreqDist(freqdist.split())
top10_words = freqdist.most_common(10)
word, count = zip(*top10_words)
print(f"Total number of unique words are {len(freqdist)}, and total number of words are {sum(freqdist.values())}")
plt.figure(figsize=(12,8))
plt.bar(word, count)
plt.title("Top 10 most frequent words")
plt.show()
reviews_label = list(data["label_text"].value_counts().index)
for i in reviews_label:
review_type = data[data["label_text"] == i]
plt.figure(figsize=(12,8))
text = " ".join(reviews.lower() for reviews in review_type["text"])
wordcloud = WordCloud(stopwords=stop_words).generate(text)
plt.title(f"WordCloud for {i}")
plt.imshow(wordcloud)
plt.show()
plt.pie(data["label_text"].value_counts(normalize = True), labels= data["label_text"].value_counts(normalize = True).index, autopct= "%1.1f%%")
plt.title("Classification percentage")
plt.show()
nlp_initial_analysis(train)
Total number of unique words are 14703, and total number of words are 74949
This indicates that the dataset contains a diverse range of sentiments with a relatively balanced spread.
No missing values were found in the dataset:
Column | Missing Count |
---|---|
text |
0 |
label |
0 |
label_text |
0 |
This confirms the dataset is clean and ready for further processing.
The distribution of sentiment classes based on the label_text
column:
Label Text | Proportion (%) |
---|---|
Positive | 27.18% |
Negative | 25.96% |
Neutral | 19.01% |
Very Positive | 15.07% |
Very Negative | 12.78% |
From this, we observe:
📌 Conclusion:
The dataset is clean, fairly balanced, and ready for exploratory data analysis (EDA), text preprocessing, and sentiment modeling tasks.
one
, like
, story
, time
, character
, make
, little
like
, character
, bad
, even
, story
, work
, little
, hard
, feel
like
, one
, work
, make
, good
, story
, character
, performance
, love
bad
, even
, dull
, minute
, character
, story
, plot
, worst
, nothing
, seem
performance
, best
, funny
, work
, make
, love
, story
, character
, year
, comedy
Across all sentiments, common thematic words include "story", "character", and "like", which reflects their central importance in reviews regardless of polarity. The sentiment-specific modifiers (e.g., bad, love, hard, funny) help distinguish the emotional direction of each review.
google_news_model = api.load("word2vec-google-news-300")
[==================================================] 100.0% 1662.8/1662.8MB downloaded
def remove_stopwords(data):
stop_words = set(stopwords.words("english"))
text = " ".join(review for review in word_tokenize(data) if review not in stop_words and review.isalnum())
return text
train["text"] = train["text"].apply(remove_stopwords)
test["text"] = test["text"].apply(remove_stopwords)
val["text"] = val["text"].apply(remove_stopwords)
def vocab_len_size(data, coverage_threshold):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
vocab_size = len(tokenizer.word_counts)
sequences = tokenizer.texts_to_sequences(data)
seq_length = sorted([len(seq) for seq in sequences])
max_len = np.percentile(seq_length, coverage_threshold)
return int(max_len), vocab_size
max_len, vocab_size = vocab_len_size(train["text"], 98)
oov_token = "<OOV>"
pad_type = "post"
trunc_type = "post"
embedding_size = 300
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(train["text"])
def texts_to_sequence(data):
text_to_sequence = tokenizer.texts_to_sequences(data)
padded = pad_sequences(text_to_sequence, maxlen= max_len, padding=pad_type, truncating=trunc_type)
return padded
train_padded = texts_to_sequence(train["text"])
test_padded = texts_to_sequence(test["text"])
val_padded = texts_to_sequence(val["text"])
def embedding_vector():
embedding_metrix = np.zeros((vocab_size, embedding_size))
for word, i in tokenizer.word_index.items():
if i < vocab_size:
try:
embedding_metrix[i] = google_news_model[word]
except KeyError:
pass
return embedding_metrix
pretrained_vector = embedding_vector()
y_train_encoded = to_categorical(train["label"], num_classes =5)
y_val_encoded = to_categorical(val["label"], num_classes =5)
y_test_encoded = to_categorical(test["label"], num_classes =5)
model1 = Sequential()
model1.add(Embedding(input_dim = vocab_size, output_dim= embedding_size, weights= [pretrained_vector], trainable = False))
model1.add(Bidirectional(LSTM(128, return_sequences=True)))
model1.add(Bidirectional(LSTM(128, return_sequences=True)))
model1.add(Bidirectional(LSTM(128)))
model1.add(Dense(5, activation="softmax"))
model1.compile(loss = "categorical_crossentropy", optimizer= "adam", metrics=[AUC(multi_label=True, name= "val_auc")])
model1.summary()
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ embedding_1 (Embedding) │ ? │ 4,411,500 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ bidirectional_3 (Bidirectional) │ ? │ 0 (unbuilt) │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ bidirectional_4 (Bidirectional) │ ? │ 0 (unbuilt) │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ bidirectional_5 (Bidirectional) │ ? │ 0 (unbuilt) │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ ? │ 0 (unbuilt) │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 4,411,500 (16.83 MB)
Trainable params: 0 (0.00 B)
Non-trainable params: 4,411,500 (16.83 MB)
earlystopping = EarlyStopping(monitor="val_val_auc", mode="max", restore_best_weights=True, patience=5)
model1.fit(train_padded, y_train_encoded, validation_data=(val_padded, y_val_encoded), epochs=10, batch_size= 64, callbacks=[earlystopping])
Epoch 1/10 134/134 ━━━━━━━━━━━━━━━━━━━━ 21s 104ms/step - loss: 1.4480 - val_auc: 0.6660 - val_loss: 1.3473 - val_val_auc: 0.7362 Epoch 2/10 134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 1.2894 - val_auc: 0.7553 - val_loss: 1.3221 - val_val_auc: 0.7487 Epoch 3/10 134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 1.2300 - val_auc: 0.7798 - val_loss: 1.3366 - val_val_auc: 0.7514 Epoch 4/10 134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 1.2085 - val_auc: 0.7913 - val_loss: 1.3198 - val_val_auc: 0.7553 Epoch 5/10 134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 95ms/step - loss: 1.1381 - val_auc: 0.8163 - val_loss: 1.3828 - val_val_auc: 0.7404 Epoch 6/10 134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 95ms/step - loss: 1.0990 - val_auc: 0.8307 - val_loss: 1.3838 - val_val_auc: 0.7428 Epoch 7/10 134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 95ms/step - loss: 1.0170 - val_auc: 0.8594 - val_loss: 1.4620 - val_val_auc: 0.7297 Epoch 8/10 134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 0.9312 - val_auc: 0.8813 - val_loss: 1.5760 - val_val_auc: 0.7353 Epoch 9/10 134/134 ━━━━━━━━━━━━━━━━━━━━ 13s 94ms/step - loss: 0.8274 - val_auc: 0.9069 - val_loss: 1.5752 - val_val_auc: 0.7239
<keras.src.callbacks.history.History at 0x7a082c5db550>
model1.evaluate(test_padded, y_test_encoded)
70/70 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 1.2888 - val_auc: 0.7587
[1.2863892316818237, 0.7601470947265625]
tokenizer_dist = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(data):
return tokenizer_dist(data["text"], truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer= tokenizer_dist, return_tensors="tf")
roc_auc = evaluate.load("roc_auc", "multiclass")
def compute_metrics(eval_pred):
preds, label = eval_pred
probs = softmax(preds, axis=1)
return roc_auc.compute(prediction_scores=probs, references=label, average="macro", multi_class="ovr")
label_1 = train[["label","label_text"]].drop_duplicates().set_index("label")["label_text"].to_dict()
label_2 = train[["label_text", "label"]].drop_duplicates().set_index("label_text")["label"].to_dict()
tf.keras.backend.clear_session()
gc.collect()
model2 = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=5, id2label=label_1, label2id=label_2)
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias'] - This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model). Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr= 3e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
tf_train_set = model2.prepare_tf_dataset(tokenized_dataset["train"], shuffle=True, batch_size=16, collate_fn=data_collator)
tf_validation_set = model2.prepare_tf_dataset(tokenized_dataset["validation"], shuffle=False, batch_size=16, collate_fn=data_collator)
tf_test_set = model2.prepare_tf_dataset(tokenized_dataset["test"], shuffle=False, batch_size=16, collate_fn=data_collator)
model2.compile(optimizer=optimizer)
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
callbacks = [metric_callback]
model2.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=5, callbacks=callbacks)
Epoch 1/5 534/534 [==============================] - 79s 116ms/step - loss: 1.2494 - val_loss: 1.1887 - roc_auc: 0.8105 Epoch 2/5 534/534 [==============================] - 50s 93ms/step - loss: 0.9486 - val_loss: 1.1901 - roc_auc: 0.8171 Epoch 3/5 534/534 [==============================] - 49s 93ms/step - loss: 0.6869 - val_loss: 1.3317 - roc_auc: 0.8154 Epoch 4/5 534/534 [==============================] - 50s 93ms/step - loss: 0.4683 - val_loss: 1.5756 - roc_auc: 0.8113 Epoch 5/5 534/534 [==============================] - 49s 92ms/step - loss: 0.3173 - val_loss: 1.6810 - roc_auc: 0.8088
<tf_keras.src.callbacks.History at 0x7a33326f82d0>
test_inputs_only = tf_test_set.map(lambda x, y: x)
test_label_only = tf_test_set.map(lambda x, y: y)
batches = [y.numpy() for y in test_label_only]
y_true = np.concatenate(batches, axis=0)
result = softmax(model2.predict(test_inputs_only)["logits"], axis=1)
139/139 [==============================] - 4s 29ms/step
roc_auc.compute(prediction_scores = result, references=y_true, average="macro", multi_class="ovr")
{'roc_auc': 0.8277722980808679}
pipe = pipeline("text-classification", model=model2, tokenizer=tokenizer_dist)
Device set to use 0
print("*******************Test Run******************************")
print(pipe(test["text"][0]))
print(f'Actual text: {test["text"][0]}, Actual Class: {test["label_text"][0]}')
*******************Test Run****************************** [{'label': 'negative', 'score': 0.8883830904960632}] Actual text: no movement , no yuks , not much of anything ., Actual Class: negative
Feature | Model 1: Word2Vec | Model 2: DistilBERT |
---|---|---|
Embedding Type | Static pretrained Word2Vec | Transformer-based DistilBERT (contextual) |
Trainable Embeddings | ❌ No (frozen) | ✅ Yes (fully fine-tuned) |
Model Architecture | Embedding → BiLSTM (x3) → Dense | DistilBERT → Dense |
Trainable Params | 0 (all frozen) | Fully trainable |
Total Params | ~4.4M (non-trainable) | Likely >60M |
Metric | Model 1: Word2Vec | Model 2: DistilBERT |
---|---|---|
Best Val AUC | 0.7553 (Epoch 4) | 0.8171 (Epoch 2) |
Final Test AUC | 0.7601 | 0.8278 (macro, full set) |
Overfitting Point | After Epoch 4 | After Epoch 2 |
Val Loss Trend | Increases after Epoch 4 | Increases after Epoch 2 |
Train Loss Trend | Smooth decrease | Rapid decrease |
Training Time | ~13s per epoch | ~49–79s per epoch |
📝 Conclusion: DistilBERT outperforms Word2Vec in both AUC and learning dynamics, thanks to richer embeddings and a more expressive architecture.