Datasets Repository

Graphs Dataset

Several graph datasets from various application domains used in my research are listed here. Each graph are attributed in nature with nodes and edges. Original datasets are converted into .g format that can be used as input for GBAD. Also .dot file format are available. If .dot file are not available, use graph2dot command in GBAD to convert to .dot file for visualiazation using Graphviz). This will help people understand the relationship between entities within each dataset and facilitate the development of graph. The .g file format

Network DoS Attack Graphs The dataset used to construct the graph is gathered from Visual Analytics Science and Technology (VAST) 2011 mini challenge 2. The original dataset consists of firewall logs, IDS logs, syslogs for all hosts on the network, and the network vulnerability scan report of a fictional organization called All Freight Corporation. The graph will be constructed using the firewall log from day one starting 08:52:52 am (beginning of the day) to 11:50:59 am (11 minutes after initiation of the DoS attack). Our main focus is to detect the onset of the DoS attack. The ground truth of the data indicates that the DoS attack started at 11:39 am and ends at 12:51 pm on day one. Also, ground truth reveals that five individual systems on the internet participated in the DoS attack on external web server and the installed IDS log was able to pick up on the DoS attack that occurred on the network at 11:43:29 am - 3 minutes and 39 seconds after they reported the denial of service attack. Our choice was driven by the fact that we wanted to include enough data that will capture the nature of traffic flow during the initialization of the DoS attack (but not the complete DoS attack traffic) so that we will be able to analyze the effect of the attack (from a graph perspective) on the network at its infancy. It should be noted that the choice of 11 minutes was somewhat arbitrary and not specific to the approach chosen.

Download DoS Attack Graph/Tool

If you use this dataset, please cite

Paudel, R., Harlan, P., and Eberle, W. Detecting the Onset of a Network Layer DoS Attack with a Graph-Based Approach. Proceedings of the FLAIRS-32, Sarasota, FL (2019)

Smart Homes Acitivity Graphs The graphs are constructed by using Kyoto dataset with 400 participants provided by Washington State University’s CASAS program. The CASAS website provides a raw sensor log dataset for each participant containing time (HH:MM:SS), sensor identification, sensor value, and an activity number to show the activity is being executed (we have constructed graphs for first 8 activities). The dataset consist of 8 graphs for each of the 8 activities for 239 healthy patient and 3 patient with cognitive impairment (can be thought as anomaly). For more detail please refer Anomaly Detection of Elderly Patient Activities in Smart Homes using a Graph-Based Approach

Download Activity Graph

If you use this dataset, please cite

Paudel, R., Eberle, W., & Holder, L. B. Anomaly Detection of Elderly Patient Activities in Smart Homes using a Graph-Based Approach. Proceedings of the International Conference on Data Science, 163-169 (2018)

Medicare Claim Graphs for Diabetic Patients The graphs are constructed by using CMS Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) provided by Centers for Medicare & Medicaid Services (CMS). Out of the 20 random sample files made available by the CMS, sub sample 1 is used. We have choosen 2009 beneficiaries from Tennessee and their inpatient, out- patient, carrier and prescription drug claims, when they have an initial diagnosis of diabetes. The graph input file is built from the dataset to reflects the relationship between beneficiaries, their claims, physicians involved, service provider institute, procedure performed, etc. Each beneficiary might have multiple inpatient, outpatient, carrier or prescription drug claims. The edge between a patient and a claim indicates that the patient filed, or was related to, the corresponding claim. It should also be noted that if a beneficiary has more than one claim, prescription, physician, etc., then multiple claim, prescription, physician, etc., nodes are created for each unique value, resulting in potentially multiple edges between the patient and these entities. For more detail please refer Detection of Anomalous Activity in Diabetic Patients Using Graph-Based Approach

Download Medicare Claim Graph

If you use this dataset, please cite one of the following paper:

Paudel, Ramesh, William Eberle, and Doug Talbert. "Detection of Anomalous Activity in Diabetic Patients Using Graph-Based Approach." Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference (2017).
Rajbhandari, Niraj. Graph Sampling to Detect Anomalies in Large Graphs and Dynamic Graph Streams. Diss. Tennessee Technological University, 2018.

Other Dataset

Twitter-Trending Topic The dataset consists of tweets and documents(primarily news stories) mentioned in the tweets. The data was collected using Twitter’s standard search API. We collected tweets related to two trending topics, “FIFA World Cup” and “NATO Summit”, during the summer of 2018. The results from Twitter’s search API contains tweet text, Twitter handle name, any hashtags and URLs mentioned in the tweet, as well as all publicly available information about the user including their name. The data for tweets is a JSON dump of individual tweets. After data collection, we manually inspected the data for the number of spam present in the dataset. We used following criteria to label the spam tweet.

If the tweet have keywords related to the trending topic but the document referred by the URL does not have any.
If the tweet have multiple link and if any of the link refer the document not related to the trending topic.
If tweet have a URL that redirects to a unrelated website before redirecting to the related website. This usually occur when the tweet have a tiny URL.

This dataset can be used for anomaly/spam detection in tweets, text mining etc. This has been use in one of our research Spam Tweet Detection in Trending Topic. The dataset has the following types of spam/anomalies in the trending tweets that are consistent with the spam scenarios listed by Twitter.

Keyword/Hashtag Hijacking
Bogus link
Link piggybacking

The datasets of Twitter trending topic can be downloaded here (Twitter-Trending-Topic.zip)

Twitter-Newsfeed Dataset The dataset is collected from News API and Twitter REST API. The News API provides headlines from 70 worldwide sources including ABC News, BBC, Bloomberg, Business Insider, Buzzfeed, Associated Press, CNN, CNBC, ESPN, Google News etc. (A complete list of all the news sources we used to collect data from is shown in Appendix 1 of data documentation.) The Twitter REST API provides tweet and publicly available twitter handler information for a specified twitter handle. The data collected in this set consists of news stories from 2/09/2017 to 6/23/2017, and associated tweets that occurred 10 days before and after the corresponding news story, based upon the twitter account (handle) mentioned in the body of the news.
How the Data was Collected? First, we collected news data from News API. The data from News API have author name, news title, news headline, news url, published date, etc. Then, in order to get the body of the news story (which is not returned from the News API), we crawled the URL for the associated news source to get the body of the news. Second, if the body of a news article references a twitter handle, the handle is sent to the Twitter REST API where all tweets 10 days around the published news story are collected. The result is two separate, comma-delimited (.csv) files, documents.csv and usertweet.csv, corresponding to news stories and tweets respectively.

This data can be useful for text/topic mining and is used for Mining Heterogeneous Graph for Patterns and Anomalies The full datasets of Twitter-Newsfeed dataset can be downloaded here (Twitter-Newsfeed.zip)

Ramesh Paudel

Contact

Professional Profiles

Data Science Blog