The Application of User Behavior Analysis by Improved K-means Algorithm Based on Hadoop

Gang Zhao

Abstract


In recent years, the new social network and mobile Internet technology promote the rapid growth of the number of network users, as well as the network data showes explosive growth.How to analysisand extract the users behavior characteristics from the mass data mining is more and more important. There are many clustering algorithms for user behavior analysis, and K-means algorithm is one of the most common method. However, there are some disadantages of K-means that lead to reduce the performance of clustering, which include that the number of clusters and the cluster center must be initialized, it issensitive to abnormal data, andit can only handle the numerical data.Based on the research of Hadoop, and clustering algorithms, we propose a K-means clustering method based on Canopy method.Finally, we carried out a comparative analysis of results of cluster experiment by comparing to the single machine experiment.The experimental results show that the K-means clustering algorithm based on Canopy is faster than the original K-means algorithm, which means that the algorithm has better expansibility, and it has a high application value.


Full Text:

PDF

References


Bentley J.L. (1975). Multidimensional binary search trees used for associative searching.Communications of the ACM,18, 9, 509–517. Doi: 10.1145/361002.361007.

Bhandarkar M. (2010). MapReduce programming with apache Hadoop. IEEE International Symposium on Parallel & Distributed Processing (IPDPS). 1-16. Doi: 10.1109/IPDPS.2010.5470377.

Botea V., Mallett D., Nascimento M. A., Sander J. (2008). Pist: An efficient andpractical indexing technique for historical spatio-temporal point data. Geo-informatica,12,2, 143–168. Doi: 10.1007/s10707-007-0030-3.

Cudre-Mauroux P., Wu E., Madden S. (2010). TrajStore: An adaptive storage system for very large trajectory data sets. In Proceedings of the 26th InternationalConference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA, 109–120. Doi: 10.1109/ICDE.2010.5447829.

Dean J., Ghemawat S. (2008). MapRedece: Simplified data processing on large clusters.Communications of the ACM, 51, 1, 107-113.

Forgy E. W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability ofclassifications. 21(2):768-769. Doi: 10.1145/1327452.1327492.

Guttman A. (1984). R-trees: a dynamic index structure for spatial searching. Proceedings of the 1984 ACM SIGMOD international conference on Management of data - SIGMOD '84, New York, USA, 14, 2, 47–57. Doi: 10.1145/971697.602266.

Hua H.Y., Zhao H.C. (2011). Application of Clustering Algorithms in Bank Customer Segmentation. Computer Engineering. 11, 3, 132-138.

Pfoser D., Jensen C. S., Theodoridis Y. (2000). Novel approaches to the indexing ofmoving object trajectories. In Proc. 26th VLDB conf., 4, 5, 395–406.

Qiu R.G., Wang K., Li S. (2014). Big Data Technologies in Support of Real TimeCapturing and Understanding of Electric Vehicle Customers Dynamics. IEEE 5th InternationalConference on Software Engineering and Service Science:85-101. Doi: 10.1109/ICSESS.2014.6933559.

Samet H. (1984). The quadtree and related hierarchical data structures.ACM Computing Surveys,16, 2,187–260. Doi: 10.1145/356924.356930.

Sang Z.P., He J.H. (2014). Research on multi-feature collaborative filtering algorithm based on Hadoop. Application Research of Computers, 15, 12,3621-3624.

Shi L.Q., Gao F., Jin Z.P. (2012). Novel design of the model of distributed NameNode inHDFS. IEEE International Conference on Cloud Computing and Intelligence Systems. 310-330.

Song Z., Roussopoulos N. (2003). Seb-tree: An approach to index continuously moving objects. In Proceedings of the 4th International Conference on Mobile DataManagement, MDM ’03, London, UK, Springer-Verlag, 25, 340–344. Doi: 10.1007/3-540-36389-0_25.

Wei F., Pan W. B., Cui Z.M. (2012). View of MapReduce: Programming model, methods, andits Applications. IETE Technical Review, 29, 5, 380-387. Doi: 10.4103/0256-4602.103168.

Xue G.R., Lin C.X., Yang Q. (2005). Scalable collaborative filtering using clusterbasedsmoothing. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05, 104-123. Doi: 10.1145/1076034.1076056.

Zhao Z.W.,Ma H.F., Fu Y.X. (2011).Research on Parallel k-means Algorithm Design Based on Hadoop Platform. Computer Science, 38, 10, 166-176.

Zhou S.H., Yin J. (2013). Parallel Web Log Mining Algorithm in Hadoop Platform. Computer Engineering, 39, 6, 43-46.


Refbacks

  • There are currently no refbacks.


Revista de la Facultad de Ingeniería,

ISSN: 2443-4477; ISSN-L:0798-4065

Edif. del Decanato de la Facultad de Ingeniería,

3º piso, Ciudad Universitaria,

Apartado 50.361, Caracas 1050-A,

Venezuela.

© Universidad Central de Venezuela