************************************************************************                                           
                               README
                  Elec: labeled and unlabeled data
                           September 2015
    Electronics product review dataset for sentiment classification 
************************************************************************

Contents:
---------
1. Data Description
2. Filename Conventions
3. Contact
4. References

---------------
1. Data Description 
This dataset consists of labeled training data, labeled test data, and 
unlabeled training data, used for the semi-supervised experiments of 
electronics product review sentiment classification in [2].  
 
The data was derived from Electronics.tar.gz at 

  http://snap.stanford.edu/data/web-Amazon.html (also see [3]) 
  
which is a large dataset of Amazon reviews.  

The labels of the labeled sets are balanced so that each labeled set contains 
the same number of positive reviews and negative reviews.  The unlabeled data
follows the natural distribution and therefore it may contain neutral reviews.  

The products reviewed in the test set are disjoint from the products reviewed 
in the labeled training set and the unlabeled training set.  Note that this 
makes the task harder and more realistic, considering the situation that new 
products come out after training.  

The labeled training set is ordered so that if we divide it into five 
partitions of equal size, the reviewed products in each partition are disjoint
from the rest.  

The data only includes the text section.  The summary section is not included.

Note: The labeled training set and the test set are also included in 
      elec2.tar.gz, which has been distributed for the supervised experiments
      in [1].


-----------------------
2. Filename Conventions

Labeled training data: elec-25k-train.${ext}
Labeled test data:     elec-test.${ext}

  ext=txt     : Text data before tokenization.  One review per line.
  ext=txt.tok : Tokenized data.  Tokens are separated by blank. 
                One review per line in the same order as *.txt
  ext=cat     : Labels.  One label per line in the same order as *.txt. 
                "1": negative, "2": positive. 

Unlabeled training data: elec-25k-unlab00.${ext}
                         elec-25k-unlab01.${ext}
  ext=txt or txt.tok

  Both '00' and '01' were used in the experiments in [2].  Each file contains 
  100K reviews.  '25k' in the filenames indicates that the data is intended 
  for use with the labeled data elec-25k-train.*.  


----------
3. Contact 
riejohnson@gmail.com

-------------
4. References
[1] Rie Johnson and Tong Zhang.  Effective use of word order for text 
categorization with convolutional neural networks.  NAACL HLT 2015.

[2] Rie Johnson and Tong Zhang.  Semi-supervised convolutional neural networks
for text categorization via region embedding.  NIPS 2015.

[3] Julian McAuley and Jure Leskovec.  Hidden factors and hidden topics: 
understanding rating dimensions with review text.  RecSys, 2013. 
 
