Some new fuzzy query processing methods based on similarity measurement and fuzzy data clustering

,


INTRODUCTION
Information systems have revolutionized the way complex and diverse information is stored and processed.As a result, the volume of information has increased significantly leading to information overload.Consequently, it becomes difficult to analyze the large amount of available data and make appropriate management decisions.In practice, information systems mainly use relational databases [1 -2] or object-oriented databases (OODB) [3 -5], to store these datasets.Both relational and object-oriented database models are capable enough of handling complex objects but are limited to inaccurate or uncertain data representations.Another problem is that the use of object-oriented and relational models has many limitations in describing and handling uncertain and incomplete information.Accordingly, a query process is not suitable for decision making.In addition, these systems can only deal with "hard" (precise and deterministic) data in the wild.However, many real-world applications always involve "soft" (vague and imprecise) data.
Besides, online consulting services have also appeared on web applications through chatbot automated consulting tools [6 -7] by applying artificial intelligence and cloud data to provide information to customers.
Furthermore, robots can communicate with humans using natural language [8].Data preprocessing is a very important step in data transformation and fuzzification to facilitate interrogation for non-expert users.
Along with the development of fuzzy math such as probability theory, fuzzy set theory, and similarity relationship [9 -12], there are fuzzy object-oriented and relational database models proposed by M. Umano et al. [13], G. Bordogna et al. [14], and Caluwe [15] as well as model of probability proposed by B. Ding et al. [16] When dealing with complex data stored as objects that may contain inaccuracies and uncertainties, fuzzy query processing proposals can be used.One such proposal, based on fuzzy set theory, employs a membership function to represent quantitative values as linguistic values.To improve query execution efficiency, a technique called horizontal fragmentation [18] is used to reduce the number of hits in fragmentation, and an efficient query execution method can be selected.Another approach combines the MapReduce computational model with the kdStreamSky fuzzy clustering technique to create a skyline query processing method, which is introduced in ref. [19].A fuzzy query processing architecture for relational databases was proposed in ref. [20].
The motivation of studying the FOODB model and query processing to solve the limitations of relational databases, crisp OODB, and FOODB for the treatment of uncertain and incomplete information becomes the subject of this paper.The contributions of this paper focus on two main issues:  Data preprocessing: Proposing a technique to evaluate the overall similarity of two objects, based on Minskowski and Euclidean distance measures. Query processing: Proposing four new fuzzy query processing algorithms, namely FQSIMSC, FQSIMMC, FQSEM, and FQINTERVAL.Among them, three algorithms FQSIMSC, FQSIMMC, FQSEM use similarity measures based on calculations SIM, SEM.The fourth algorithm FQINTERVAL processes queries directly on fuzzy intervals based on the EMC clustering algorithm.
The paper is structured as follows.Part 2 compares two objects based on ambiguous similarity measure and data semantics, part 3 presents fuzzy query processing based on similarity measure, clustering algorithm and separation of fuzzy intervals.The experiment and conclusion are finally stated in parts 4 and 5.

Measurement of ambiguous data semantics
The semantic space of the ambiguous data type is represented by the ability distribution.The descriptions of the semantic space are represented through semantic relationships.The measure of semantic inclusion and semantic equivalence is usually applied to the degree of semantic inclusion [21 -22].Given the universe set { }, with two fuzzy data and defined on the domain U based on the probability of distribution and ( ) , it proves the possibility that is true.The symbol ( ) is defined as follows: According to the above definition, the concepts of similarity can be inferred as follows.Let and be two fuzzy data and ( ) is the degree to which covers the semantics of .( ) is defined as follows:

Comparison of two objects based on ambiguous similarity measure [27]
Example 2.1.Comparing the similarity between two fuzzy objects.An employee wants to rent an apartment and wants to compare apartments to choose the most fit one.Each apartment is defined by area, price and distance to work.Assume that two apartments have been found as shown in Figure 1, and "How can these two apartments be compared?".The description of the two apartments as shown in Figure 1 is ambiguous, since the values of attributes are of mixed form, i.e. numeric values and language values [23].In other words, Apartment 1 and Apartment 2 are ambiguous objects of the fuzzy apartment class (each ambiguous object has at least one value of the ambiguous property).
The following methods help us calculate the similarity between two ambiguous objects: compare two ambiguous properties, compare a crisp attribute with an ambiguous attribute and vice versa, compare two objects with the same instance of a class, and compare two objects that are instances of two different classes.

Compare two fuzzy attributes
In this section, to solve case I, we compare objects with fuzzy properties.Initially, determine the similarity of two fuzzy objects through the ambiguity property and then compute the general similarity of the two ambiguous objects using formulas (9) and (10).
Let two objects be and , the corresponding sets of attribute sets are as follows:
The similarity [ ] between two attributes corresponding to is defined as follows: where j, are the attribute with j = 1, 2,…, n, with n being the numeral of attributes and the distance measure d is denoted by representing [ ] [ ] as follows: where the corresponding attribute values of with is being the representative fuzzy subset for the value of the attribute belonging to the domain .is represented as follows: The distance ( ) ( ) [ ] The difference of fuzzy sets can be determined as follows: If the properties are linguistic values and their semantics are determined using the member function ( ) ( ) with every , to compare two apartments (see Figure 3 in example 1), then: The similar definition proposed in equation ( 3) allows us to evaluate the degree of similarity between the properties of two instances.In equation ( 3), the distance d is adjusted for match based on the parameter .The similarity measure ( ) between two fuzzy objects and is: where the mapping [ ] [ ] is an aggregate operation such as minimum function and weighted mean: 1. Average weight of similarity points of the attributes: The minimum similarities of the attributes are: Using the membership function shown in Figure 3, the similarity calculation between these attributes can be measured by: Due to the characteristics of the object-oriented database model and the density of data distribution of this model, applying two bell and trapezoidal membership functions to represent quantitative values in qualitative form is the most effective.Moreover, the calculation and data processing on the two bell and trapezoid member functions is more efficient than the other membership functions due to the symmetry of the bell member function and the common ease of use of the trapezoid.Therefore, choosing these two types of membership functions is suitable as a theoretical basis for analyzing this problem.

Compare the similarity of two objects of the same class
Let class C has the following properties: { } and two objects and are of class C with the same set of attributes.Assuming that ( ) and ( ) are ambiguous values, to compare two ambiguous instances and and compute ( ), we first compare their respective attributes.For each pair of values of the same attribute ( ( )) we need to calculate their degree of equivalence, represented by Finally, we have [28]: ) ( ) We propose fuzzy query processing methods for the two cases described in Figure 3.For case 1, we compare two objects based on similarity measures such as Compare two fuzzy attributes, Compare the similarity of two objects of the same class and different classes to develop three algorithms for fuzzy query processing such as: FQSIMSC, FQSIMMC, FQSEM.For case 2, we implement a fuzzy interval division algorithm based on the results of the improved clustering algorithm EMC.From the results obtained, we develop a clustered fuzzy query processing algorithm named FQINTERVAL.

Query Processing Based on ambiguous Similarity Measure
Based on the fuzzy class model and fuzzy graph, we build a FOQL fuzzy query processing structure with the following general form.Where, <Query condition> is a fuzzy condition and all thresholds are sequences of numbers in [0;1].By using such FOQL, one can extract these objects that belong to the subthreshold class, while satisfying the query condition to be below the threshold.Note that the THOLD threshold entry can be omitted.In this case the default of the exact threshold is 1.
Case 1: Processing queries for objects with crisp and fuzzy attribute values.In this case, we rely on the calculation of the DIS distance measure and the SIM analog calculation to perform DIS and SIM calculations using various member functions to convert the fuzzy values of the properties contained in the database and the fuzzy values of the user from the conditional clause to the value form membership function.For example, the attribute "Area" with an opacity value of "Medium" is converted to a membership function value of "0.348", or the attribute "Price" with an explicit value of "840$" is converted to a membership function value of "0.7" with the blur interval value "Dearness".We build a query with single-condition and multi-condition clauses represented by the following algorithms: 1. Fuzzy query processing for single-condition case has the following form: SELECT .....FROM C WHERE fvalue THOLD fthreshold.The condition A attr  fvalue, where fvalue is the fuzzy value and A attr is the fuzzy attribute of the fuzzy class C, {=, ≠, <, >}, threshold .Use the member function in Figure 4 to convert the fvalue to a value of membership.

24: End
Case 1: From the data in Table 1, it is necessary to extract information about "FOID, Apartment Type" with the condition of the query as "Area=Large and Price=Regular".
The query has the following form: FOQL1: SELECT C.FOID,C.ApartmentName FROM C WHERE C.Area=Large and Price=Regular THOLD 0.92.The results of fuzzy query (FOQL1) for case 1 are described in Table 2.
Case 2: From the data in Table 1, it is necessary to extract information about "FOID, Apartment Type".In which, the data extraction condition of the query is "Price=about $840" or "Area= Large".The structure of the fuzzy query is shown below: FOQL2: SELECT C.FOID, C.[Apartment Type] FROM C WHERE C.Price="about $840" or Area= "Large" THOLD 1/6.The results of fuzzy query (FOQL2) for case 2 are described in Table 3.Besides the fuzzy query processing methods introduced above, in order to increase the efficiency and flexibility in the query, the paper proposes a fuzzy cluster query method based on the improved clustering algorithms EMC and the fuzzy partitioning algorithm of [26] as the basis for developing the query problem based on fuzzy intervals.The data in Table 4 show that the crisp value of the area attribute is assigned to three fuzzy areas, namely

8:
End for

9:
Return ; From the data in Table 4, it is necessary to extract information about "FOID, Apartment Type" with the condition of the query as "Area is Small".
The query has the following form: FOQL3:SELECT C.FOID, C.ApartmentName, C.Area FROM C WHERE C.Area is 'Small' THOLD 0.92.The results of fuzzy query (FOQL3) is described in Table 5.By using the language variable of the area property and the linguistic value of this variable being small as the conditional clause, we can obtain a list of corresponding crisp values that is ( ) .The results of the SQL3 query of example 7 show that the time to execute algorithm 5 (FQINTERVAL) for the RoomBooking database is less than that of algorithm 2 (FQSIMMC) and algorithm 3 (FQSEM), respectively to examples 5 and 6.The execution times of the RoomBooking, ProjectManagement and CourseScoresManagement datasets for algorithm 5 (FQINTERVAL) are 3503, 14012 and 14712, respectively (see Table 5).Compared with 2 algorithms (FQSIMMC) and (FQSEM) for RoomBooking, ProjectManagement and ManageCourseScores are 2045: 2024, 6135: 6072 and 6544: 6555, respectively (see Figure 5).
Through the experimental results performed on an Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz 2.90 GHz, 16GB RAM, Operating system Windows 10, We found that algorithm 5 (FQINTERVAL) is expensive lowest time.That is because the data extraction is performed directly on the preprocessed fuzzy partitions based on EMC clustering algorithm and fuzzy partitioning.
Analyzed data are presented in tables and figures with taking care to avoid unnecessary repetition of tabular data.Information presented in tables should not be repeated in figures, or vice versa.Standard deviations/errors help readers follow the trend of results and should be supplied whenever appropriate.According to Table 7, the memory usage of the two algorithms FQSIMMC and FQSEM is larger than that of the FQINTERVAL algorithm.The memory usage of these algorithms for RoomBooking, ProjectManagement and CourseScoresManagement is 896, 2688 and 2867, respectively (see Figure 6).The two algorithms FQSIMMC and FQSEM use larger memory because both of them load all data into main memory for processing.However, the FQINTERVAL algorithm uses less memory because the query processing is only performed directly on the defined fuzzy intevals.Conclusion for the experimental evaluation: From the experimental evaluation results of processing time and memory storage space, it can be seen that the clustered query processing algorithm FQINTERVAL consumes the best processing time and memory space compared to the other two algorithms (FQSIMMC, FQSEM).To explain this result, the FQINTERVAL algorithm performs direct data extraction based on pre-loaded fuzzy partitions in the main memory area and has the complexity of the O(n) algorithm.However, the FQINTERVAL algorithm has some limitations due to the lack of operations on the set (Union, intersec).In addition, the two algorithms FQSIMMC and FQSEM have ( ) complexity and large memory capacity because they have to read the internal data files from the auxiliary memory devices to the main memory.However, the advantages of these two algorithms are their continuous data updates and their flexibility when dealing with complex data.

CONCLUSIONS
The article proposes different methods to handle fuzzy queries effectively.By applying techniques such as semantic similarity assessment, similarity assessment for sharp and fuzzy data.From there, the article proposes four effective fuzzy query processing algorithms: FQSIMSC for single condition cases, FQSIMMC for multi-condition cases, FQSEMand clustered fuzzy query processing algorithm, which which are based on the improved clustering algorithm EMC and fuzzy partitioning method.However, each of these four algorithms has advantages and disadvantages.Therefore, depending on different situations, we choose the appropriate query, such as the data type or frequency of data access as well as the size of the data, specifically: -The FQINTERVAL cluster query processing algorithm performs data extraction directly based on preloaded fuzzy partitions in the main memory area due to measuring fast processing time.However, this algorithm has some limitations due to the lack of operations on sets (Union, intersec).
-The three algorithms FQSIMSC, FQSIMMC and FQSEM have large processing times and memory capacity due to having to read internal data files from secondary memory devices into main memory.However, the advantage of these three algorithms is that they update new and flexible data.
The evaluation of the proposed results is performed based on different datasets extracted from the UCI database.

Figure 1 .
Figure 1.A good example of an ambiguous object comparison.

Figure 2 .
Figure 2. Fuzzy representation of area and price of two apartments.

Figure 3 .
Figure 3.The framework of proposed methods for fuzzy query processing.

Figure 4 .
Figure 4. Description of the query processing algorithm on fuzzy intervals.

3 :
purpose of fuzzy interval classification is to convert quantitative values to linguistic values with linguistic variables as representative attributes.From there, to make data extraction by query statements more flexible and natural, or specifically, the value of the conditional clause is a linguistic value such as Small, Medium, or Large.Implement fuzzy FOQL query algorithm based on fuzzy intervalsAlgorithm 5: FQINTERVAL (Fuzzy Query Interval). .Input: Let C be the class with attributes { } the set of objects of class { }.The query has the following form: SELECT . . .FROM C WHERE THOLD 1.0 Output: The object set of is satisfied with .Initialisation: Implement the algorithm to determine the fuzzy intervals; interval [k] contains fuzzy intervals after implementing EMC,

Figure 6 .
Figure 6.Evaluations of memory usage for different data sets.

Query Sim Single Condition).
Class C with attributes {A 1 , A 2 ,…, A n }, set of objects of class C: {O i , i = 1,…,m}, parameter fthreshold and [ ], K K positive integers with the default value K=1.Query processing for objects with estimated attribute values and Query processing for objects with estimated attribute values and objects of different classes.In this case, we rely on the probability distribution and semantic similarity calculations of SID and SEM.To perform calculations of SID and SEM, through the conditional clause, we use different suggestions[22 - 25]to separate and convert the estimated values of the attribute and the user's value to the distribution form ability.For example, the time attribute "about 21" is represented by a Output: Object set O result satisfying for all t  O we have t[A attr ]  fvalue with a give fthreshold.

Table 1 .
Data list of fuzzy objects about apartment (for case 1).

Table 2 .
Fuzzy query results for case 1.

Table 3 .
Data list of fuzzy objects about apartment (for case 2).

Table 4 .
The result of the clustering algorithm EMC and Fuzzy interval.

Table 5 .
Fuzzy query results based on fuzzy intervals.

Table 6 .
Execution time in algorithms.

Table 7 .
Memory usage in algorithms.