Περίληψη σε άλλη γλώσσα
During the last years, the research field of data mining has presented significant advances. The developments in the fields of automatic data collection, very large databases, and data warehouses from heterogeneous data sources, resulted to very large volumes of data. The analysis of such volumes is not feasible without the aid of the efficient and semi-automatic methods of data mining. Recently, there has been developed new databases for more complex forms of data, compared to relational ones. E.g., customer-transaction, object oriented, spatial/temporal, and sequence databases, or various collections of Web data. The main characteristics of the aforementioned databases are: (i) the form of their data, which differs significantly from that of relational data, and (ii) their large size, both due to their complex type and their large volumes. Therefore, there emerges the need for new data mining techniques for this kind of databases, which comprises the motivation of the present dissert ...
During the last years, the research field of data mining has presented significant advances. The developments in the fields of automatic data collection, very large databases, and data warehouses from heterogeneous data sources, resulted to very large volumes of data. The analysis of such volumes is not feasible without the aid of the efficient and semi-automatic methods of data mining. Recently, there has been developed new databases for more complex forms of data, compared to relational ones. E.g., customer-transaction, object oriented, spatial/temporal, and sequence databases, or various collections of Web data. The main characteristics of the aforementioned databases are: (i) the form of their data, which differs significantly from that of relational data, and (ii) their large size, both due to their complex type and their large volumes. Therefore, there emerges the need for new data mining techniques for this kind of databases, which comprises the motivation of the present dissertation. The contribution of the dissertation focuses on the following subjects. In Chapter 2 we examine the problem of mining patterns from models that have a graph-structure representation (for instance, web-logs). In such models, users navigate via the links of the graph. We present three algorithms, one of which is level-wise, and the two others that are non-level-wise. Moreover, we examine the fact that random accesses (noise) can be interleaved with patterns. The definition of the mined pattern is extended to take this fact into account. The performance of the algorithms and their sensitivity with respect to several parameters are examined experimentally. In Chapter 3 we propose a new technique for similarity searching queries in transactions databases, which find important applications in cases like recommendation systems. We develop a new representation method, for which we prove that it produces correct results. We also propose new algorithms for processing similarity queries. Extended experimental results indicate the superiority of the proposed method. In Chapter 4 we focus on the development of methods for the storage and searching large collections of sequential patterns, an operation that is useful in post-processing data mining results. We describe a family of algorithms that takes into account the ordering of elements within sequential patterns. More-over, we consider the fact that the distribution of elements within sequences is skewed, to propose a new algorithm for approximating the encoding of sequences. Experimental results examine all the proposed algorithms. In Chapter 5 we describe the C2P spatial clustering algorithm. C2P exploits spatial access methods and closest-pair queries. We present extensions for clustering very large spatial databases with noise and clusters of various shapes. Due its characteristics, C2P combines the advantages of existing algorithms without presenting their deficiencies. Its performance is examined with experimental results, which illustrate its good performance with respect to clustering quality and execution time. In Chapter 6 we examine density-biased sampling techniques. This kind of sampling addresses the deficiencies of uniform sampling in cases of spatial databases that contain samples with skewed sizes. It is useful in the pre-processing step of data mining. We develop a new method that exploits spatial indexes and the density information that is preserved within them. The proposed method attains improved sampling quality and reduced execution times. Experimental results indicate its superiority. Finally, Chapter 7 concludes this dissertation, and gives extensions and directions of future work.
περισσότερα