How To: Data Mining
- 18th August 2020
- Posted by: Claudine Gabriele
- Category: Articles
Where is the first place to start with data mining? What are the most common challenges with public data? Our bioinformatics team have answered some key questions about data mining below.
How do you start looking for the right type of datasets when data mining?
This can depend on the project. If a client has an interest in a particular database, we can look directly at that database and filter the data according to their specifications. This is similar to the data landscaping method of searching for datasets.
Alternatively, if the interest resides in answering a question, there are a number of public repositories that are frequently used which hold a range of datasets. The public databases usually have a search engine specific to that database. Therefore it can be good to check on the individual websites once you have selected which databases are suitable.
What do you look for in a dataset?
Once we have a query in mind, we can select keywords and perform a search and refine this depending on the scale of the results. After we have retrieved the datasets, we can then further curate them for suitability for the analysis in mind. As long as we can examine and process the datatypes, these can potentially be included in the analysis stage. The size of the dataset can play a factor, due to storage requirements.
Public data repositories are regularly updated; depending on a particular database’s own curation criteria, this can affect downstream analyses. Some databases may provide the original data, but also make pre-processed data available for public use.
What are the next steps once you have found suitable datasets?
Once we find suitable databases, our data mining process follows these four main steps:
1) Download and check data integrity. This is important for large dataset files; during a data transfer it is possible for the data to become corrupted and not have a full data transfer.
2) Pre-processing, also known as data-cleaning. This can include checking for data consistency or converting data where necessary for downstream analyses.
3) Perform quality control of the data.
4) Perform downstream analysis or data filtering as per project requirements.
How do you present the data mining findings back to a client?
For quantitative data, the findings could involve quality control and data pre-processing such as normalisation and further downstream analyses. The results of these are then presented in a report through summaries, graphics, and tables. Ultimately, Fios aim to present the findings back to the client in a clear way through processed results and summaries of that data to address the client’s questions. This is what makes each of our reports so bespoke.
What challenges do you normally face with data mining?
Some challenges are simple and relate to the quality of the data. This could be where there is a lack of samples (giving statistical limitations) or unsuitable datasets with overlapping criteria crowding out relevant datasets.
Other challenges can be quite specific. Some databases limit the amount of traffic to their websites because they receive a lot of traffic. Should you need to conduct a large number of queries, the time it will take to conduct them will add up. However, database owners usually specify limitations and the quickest way to query their databases.
Why use data mining?
Data mining can give information on what is publicly available regarding your research area. It can help contextualise a research question; for example, if you are interested in pursuing a certain line of inquiry for your research, if there is already a dataset available that can give an idea of what is expected, then this could factor into the project.
It is also possible to find alternative datasets for comparison in case you have generated your own data.
Fios regularly use mined data in projects, combining multi-omics datasets. We can also combine your generated data with public domain data. For more information – and to view one of our public reports utilising public data – get in contact today.
Read more
How To Get the Most Out Of Your Existing Data
Leave a Reply
You must be logged in to post a comment.