K-Means algorithm to cluster 1 million news headlines

1 Dataset Information Sources

News headline data published by the Australian Broadcasting Corporation ABC

Import related modules:

import numpy as np 
 import pandas as pd 
 import matplotlib . pyplot as plt
 import seaborn as sns
 from sklearn . feature_extraction import text
 from sklearn . feature_extraction . text import TfidfVectorizer
 from sklearn . cluster import KMeans
 from nltk . tokenize import RegexpTokenizer
from nltk . stem . snowball import SnowballStemmer
 % matplotlib inline
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

Read the dataset: The dataset download link is at the bottom of the article

# read dataset 
data = pd . read_csv ( "C:/Users/86135/AI/Lesson2/output_2021-08-26-14_37_47/abcnews-date-text(1).csv" , error_bad_lines = False , usecols = [ "headline_text" ] ) 
data . head ( ) 
data = data . head ( 10000 )   # Get some data to run quickly, you can try to modify the amount of data used to check the subsequent modeling effect, but pay attention to the more data used, the more the subsequent model training The longer the time 
#print(data.info)# print data information
  • 1
  • 2
  • 3
  • 4
  • 5

1.1 Deduplication

Duplicate rows of data can be viewed with pandas .DataFrame.duplicated . For specific methods, see: DataFrame.duplicated()

# View duplicate data rows 
data [ data [ 'headline_text' ] . duplicated ( keep = False ) ] . sort_values ( 'headline_text' ) . head ( 8 )
  • 1
  • 2

Duplicate rows can be viewed with pandas.DataFrame.drop_duplicates. For specific methods, see: DataFrame.drop_duplicates()

# remove duplicate lines, 
data = data . drop_duplicates ( 'headline_text' )
  • 1
  • 2

2 Data preprocessing

2.1 Preprocessing for vectorized representation

When doing natural language processing, words must be converted into vectors that can be utilized by machine learning algorithms. If the goal is to do machine learning modeling of textual data, such as movie reviews or tweets or anything else, you need to convert the textual data to numbers. This process is called "embedding" or "vectorization".
When doing vectorization, it's important to remember that it's not just about turning a single word into a single number. Words can be converted to numbers, and entire documents can be converted to vectors. Vectors tend to have more than one dimension, and for textual data, vectors are often high-dimensional. This is because each dimension of the feature data will correspond to a word, and the documents we are dealing with often contain thousands of words.

2.2 TF-IDF

In information retrieval, tf–idf or TFIDF (term frequency–inverse document frequency) is a numerical statistic designed to reflect the importance of words to documents in a corpus. It is often used as a weighting factor in searches for information retrieval, text mining, and user modeling. The tf-idf value is proportional to the number of times the word appears in the document, while being offset by the frequency of the word in the corpus, which helps to adjust for the fact that certain words generally appear more frequently. Today, tf-idf is one of the most popular term weighting schemes. In the digital library domain, 83% of text-based recommender systems use tf-idf.

Search engines often use a variant of the tf–idf weighting scheme as their primary tool for scoring and ranking documents for relevance given a user query. tf–idf can be successfully used for stopword filtering in various domains, including text summarization and classification.

The simplest of the ranking functions is derived by adding the tf–idf for each query term, and many more complex ranking functions are variations of this simple model.

When there are TF (word frequency) and IDF (inverse document frequency), multiply these two words to get the TF-IDF value of a word. The larger the TF-IDF of a word in the article, the higher the importance of the word in the article in general, so by calculating the TF-IDF of each word in the article, sorted from large to small, and ranked in the The first few words are the keywords of the article

tf-idf algorithm steps:

  • Calculate word frequency:
    the number of times a word appears in the article = p, the total number of words in the article = n
    normalized word frequency (tf) = p/n p/n p / n

  • Calculate the inverse document frequency
    At this point, a corpus is needed to simulate the environment in which the language is used
    Inverse document frequency (idf) = log ⁡ ( 语 料 库 文 档 总 数 / 包 含 该 词 的 文 档 树 + 1 ) \log (语料库文档总数/{包含该词的文档树+1}) log ( total number of corpus documents / document tree containing the word+1 )

    It can be seen that the more common a word is, the larger the denominator, the smaller the inverse document frequency, the closer to 0, the denominator + 1 is to prevent all documents from not containing the word (to prevent the denominator from being 0)

  • Calculate tf-idf
    t f − i d f = 词 频 ( t f ) ∗ 逆 文 档 频 率 ( i d f ) tf-idf = 词频(tf) * 逆文档频率(idf) t fi d f=Word frequency ( t f )Inverse document frequency ( i d f )

Advantages and disadvantages of
TF-IDF: The advantages of TF-IDF are that it is simple, fast, and easy to understand. The disadvantage is that sometimes using word frequency to measure the importance of a word in the article is not comprehensive enough, and sometimes important words may not appear enough, and this calculation cannot reflect the location information and the importance of the word in the context. If you want to reflect the context structure of words, then you may need to use the word2vec algorithm to support.

punc =  [ '.' ,  ',' ,  '"' ,  "'" ,  '?' ,  '!' ,  ':' ,  ';' ,  '(' ,  ')' ,  '[' ,  ']' ,  '{' ,  '}' , "%" ] 
stop_words = text . ENGLISH_STOP_WORDS . union ( punc ) 
desc = data [ 'headline_text'] . values
 #print(desc)
vectorizer = TfidfVectorizer ( stop_words = stop_words ) #Class  call 
#print(vectorizer) 
X = vectorizer . fit_transform ( desc ) #Count  the number of times a word appears 
#print(X) 
word_features = vectorizer . get_feature_names ( ) #Get all the text in the word bag keywords #print 
(word_features) 
print ( len ( word_features ) ) 
print ( word_features [ 0 : 50 ] )
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

2.3 Stemming

Stemming is the process of reducing a word to its stem (i.e. root form). The root form is not necessarily the word itself, but can be generated by concatenating the correct suffixes. For example, the words "fish", "fishes" and "fishing" all stem from "fish", which is a correct word. On the other hand, the words "study", "studies" and "studying" are derived from "studi" which is not a proper English word.

2.4 Tokenizing

Tokenization breaks sentences into words and punctuation

stemmer = SnowballStemmer ( 'english' ) 
tokenizer = RegexpTokenizer ( r'[a-zA-Z\']+' ) #Decompose  the sentence according to the regular expression you set 
wordslist = tokenizer . tokenize ( desc [ 0 ] ) 
' ''
print(wordslist)
for word in wordslist:
    print(stemmer.stem(word))
''' 
def  tokenize ( text ) : 
    return  [ stemmer . stem ( word )  for word in tokenizer . tokenize ( text . lower ( ) ) ]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

2.5 TFIDF vectorization with stopwords, stemming and custom tokenizing

vectorizer2 = TfidfVectorizer ( stop_words = stop_words , tokenizer = tokenize ) 
#print(vectorizer2) 
X2 = vectorizer2 . fit_transform ( desc ) 
word_features2 = vectorizer2 . get_feature_names ( ) 
print ( len ( word_features2 ) ) 
print ( word_features2 [ : 50 ] )

vectorizer3 = TfidfVectorizer ( stop_words = stop_words , tokenizer = tokenize , max_features =  1000 ) 
X3 = vectorizer3 . fit_transform ( desc ) 
words = vectorizer3 . get_feature_names ( ) 
print ( len ( words ) ) 
print ( words [ : 50 ] ) 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

3 K-Means Clustering

3.1 Using the elbow method to select the number of clusters

As the number of clusters k increases, the sample division will be more refined, the degree of aggregation of each cluster will gradually increase, then the sum of squared errors SSE will naturally become smaller, and when k is less than the actual number of clusters, due to The increase of k will greatly increase the degree of aggregation of each cluster, so the decline of SSE will be large, and when k reaches the real number of clusters, the return of the degree of aggregation obtained by increasing k will rapidly decrease, so the decline of SSE The amplitude will decrease sharply, and then become flat as the value of k continues to increase , that is to say, the relationship between SSE and k is similar to the shape of an elbow, and the value of k corresponding to this elbow is the true number of clusters of the data. So this method is called the elbow method.

from sklearn .cluster import KMeans _
wcss =  [ ] 
for i in  range ( 1 , 11 ) : 
    kmeans = KMeans ( n_clusters = i , init = 'k-means++ ' , max_iter = 300 , n_init = 10 , random_state = 0 ) 
    kmeans .fit ( X3 ) 
    wcss . append ( kmeans.inertia_ ) _ _
plt . plot ( range ( 1 , 11 ) , wcss ) 
plt . title ( 'The Elbow Method' ) 
plt . xlabel ( 'Number of clusters' ) 
plt . ylabel ( 'WCSS' ) 
plt . savefig ( 'elbow.png' ) 
plt . show ( )
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

The effect is shown in the figure:
insert image description here
Since multiple elbows can be generated, it is sometimes necessary to choose the right number of clusters by trial and error. The results for different numbers of clusters are shown below to find the right number of clusters.

def  MyKMeans ( clusters , iters , jobs , datas ) : 
    kmeans = KMeans ( n_clusters = clusters , n_init = iters , n_jobs = jobs ) 
    kmeans . fit ( datas ) 
    # argsort usage: https://numpy.org/doc /stable/reference/generated/numpy.argsort.html 
    common_words = kmeans.cluster_centers_.argsort ( ) [ _ _ _ _: , - 1 : - 26 : - 1 ] 
    ans =  [ ] 
    for num , centroid in  enumerate ( common_words ) : 
        print ( str ( num )  +  ' : '  +  ', ' . join ( words [ word ]  for word in centroid ) )
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

3.2 Clusters equal to 3

MyKMeans ( 3 , 20 , 1 , X3 )
  • 1

The output is as follows:

0 : iraq, plan, govt, new, man, win, say, council, iraqi, claim, charg, warn, report, baghdad, kill, fund, urg, world, water, court, face, nsw, troop, rain, death
1 : polic, probe, man, arrest, search, death, investig, murder, charg, drug, stab, wa, cannabi, station, fatal, car, miss, victim, protest, road, suspect, driver, nt, corrupt, new
2 : war, protest, anti, iraq, howard, ralli, pm, post, say, plan, student, fear, condemn, iraqi, bush, market, thousand, march, downer, warn, deni, start, end, stage, peac
  • 1
  • 2
  • 3

3.3 Clusters equal to 5

MyKMeans ( 5 , 20 , 1 , X3 )
  • 1

The output is as follows:

0 : polic, man, govt, win, new, council, charg, say, claim, warn, court, report, fund, face, death, baghdad, world, kill, urg, nsw, rain, set, crash, water, cup
1 : iraqi, diplomat, forc, baghdad, expel, coalit, marin, kill, missil, say, war, civilian, bomb, saddam, claim, surrend, suicid, refuge, troop, attack, border, aid, basra, weapon, tv
2 : plan, water, shire, council, park, new, manag, protest, govt, firm, green, begin, group, m, welcom, merger, defend, health, rail, land, farmer, station, burn, concern, union
3 : iraq, war, say, missil, troop, howard, deni, post, bush, destroy, blair, pm, report, bomb, british, attack, forc, kill, turkey, aid, warn, tv, resolut, blix, uk
4 : war, protest, anti, howard, ralli, pm, student, thousand, march, fear, street, condemn, peac, open, say, arrest, melbourn, market, downer, day, warn, start, nz, polic, hous
  • 1
  • 2
  • 3
  • 4
  • 5

3.4 Clusters equal to 8

MyKMeans ( 8 , 20 , 1 , X3 )
  • 1

The output is as follows:

0 : plan, govt, council, iraqi, say, claim, warn, report, baghdad, world, fund, urg, kill, nsw, cup, water, set, crash, troop, lead, final, meet, death, ban, continue
1 : war, protest, anti, howard, ralli, pm, student, plan, iraqi, thousand, fear, say, march, condemn, crean, melbourn, street, market, day, gulf, warn, start, oil, open, stage
2 : polic, man, charg, murder, face, court, probe, stab, death, arrest, search, jail, car, drug, miss, fatal, assault, investig, accid, crash, station, cannabi, sex, wa, attack
3 : iraq, war, say, missil, troop, deni, post, blair, bush, destroy, howard, pm, report, bomb, british, attack, forc, kill, turkey, aid, warn, tv, blix, uk, kuwait
4 : concern, air, aust, strike, rise, toll, worker, pay, death, qld, council, job, baghdad, govt, iraqi, market, teacher, rate, open, cut, saddam, water, group, troop, nz
5 : win, lead, season, fan, m, goal, india, award, open, hope, tiger, championship, world, gold, return, titl, thriller, cup, coast, stage, streak, best, case, celebr, waratah
6 : new, resolut, plan, hope, appoint, ceo, presid, work, open, look, polic, hit, law, high, rate, compani, govt, wa, hospit, servic, iraq, coach, set, board, tas
7 : rain, drought, farmer, water, relief, help, bring, offer, need, qld, boost, fund, restrict, end, toll, break, hope, affect, despit, eas, welcom, impact, nsw, flood, fall
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Finally, we can manually judge which cluster has the best effect based on the clustering results generated by different numbers of clusters.

Dataset :
Link: https://pan.baidu.com/s/1A2eyF7QdoFf0H5Gv92gMaA
Extraction code: 7bk7
– Sharing from Baidu Skydisk Super Member V5

Related: K-Means algorithm to cluster 1 million news headlines