Type Dependency Parsing using Java
It is possible to extract meaningful terms and concepts from unstructured information with the help of Text Type Dependency Parsers.
Unstructured information can be Text files,PDF, or MS Word documents buried on a hard drives, within emails on an exchange server, or even Audio streams after they are converted to text.
We won't waste much time going over Tree Structures,etc. and just simply dive into the good stuff..
Stanford Parser
ConceptExtractorImpl.java
The above code will extract "meaning" from the example sentence.import java.util.ArrayList; import java.util.Collection; import java.util.Iterator; import java.util.List; import org.apache.log4j.Logger; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; import edu.stanford.nlp.trees.GrammaticalStructure; import edu.stanford.nlp.trees.GrammaticalStructureFactory; import edu.stanford.nlp.trees.PennTreebankLanguagePack; import edu.stanford.nlp.trees.Tree; import edu.stanford.nlp.trees.TreebankLanguagePack; import edu.stanford.nlp.trees.TypedDependency; public class ConceptExtractorImpl { Logger logger = Logger.getLogger(ConceptExtractorImpl.class); int maxWordCount = 50; LexicalizedParser lexicalizedParser = null; TreebankLanguagePack treebankLanguagePack = null; GrammaticalStructureFactory grammaticalStructureFactory = null; String lexicalizedParserFile = "conf/models/standford/englishPCFG.ser.gz"; public ConceptExtractorImpl(){ lexicalizedParser = new LexicalizedParser(lexicalizedParserFile); lexicalizedParser.setOptionFlags(new String[] { "-maxLength", "80", "-retainTmpSubcategories" }); treebankLanguagePack = new PennTreebankLanguagePack(); grammaticalStructureFactory = treebankLanguagePack.grammaticalStructureFactory(); } public String removePosition(String text){ StringBuffer newWord = new StringBuffer(); boolean isLastPosition = false; for(int index=text.length()-1;index>=0;index--){ if ( isLastPosition){ newWord.append(text.charAt(index)); } if ( text.charAt(index) == '-'){ isLastPosition = true; } } StringBuffer word = new StringBuffer(); for(int index=newWord.length()-1;index>=0;index--){ word.append(newWord.charAt(index)); } return word.toString(); } private boolean shouldUse(TypedDependency typedDependency) { boolean shouldUse = false; if( typedDependency.reln().getShortName().trim().equalsIgnoreCase("nn") || typedDependency.reln().getShortName().trim().equalsIgnoreCase("prep") || typedDependency.reln().getShortName().trim().equalsIgnoreCase("dep") || typedDependency.reln().getShortName().trim().equalsIgnoreCase("conj_and") || typedDependency.reln().getShortName().trim().equalsIgnoreCase("num") || typedDependency.reln().getShortName().trim().equalsIgnoreCase("amod") ){ shouldUse = true; } return shouldUse; } public List<String> extractConcepts(String sentence){ List<String> concepts = new ArrayList(); Tree tree = (Tree) lexicalizedParser.apply(sentence); StringBuffer typedDependcy = new StringBuffer(); GrammaticalStructure gs = grammaticalStructureFactory.newGrammaticalStructure(tree); Collection tdl = gs.typedDependenciesCCprocessed(false); Iterator it = tdl.iterator(); while (it.hasNext()) { TypedDependency typedDependency = (TypedDependency) it.next(); String phrase = removePosition( typedDependency.dep().toString().toLowerCase() ) + " " + removePosition( typedDependency.gov().toString().toLowerCase()); if ( shouldUse(typedDependency) ){ concepts.add(phrase); } } return concepts; } }
The woman has cancer in lower left lung.
and extract the following concepts from it:
1. lower lung.
2. left lung.
3. lung cancer.
After obtaining these terms and concepts one usually maps them to a taxonomy.
Labels: text analytics, Type Dependency Parsing
1 Comments:
must use this version of the stanford parser:
https://wiki.csc.calpoly.edu/CSC-581-S11-06/export/2/trunk/Stanford/stanford-parser-2011-04-20/stanford-parser.jar
April 30, 2013 at 7:29 PM
Post a Comment
Subscribe to Post Comments [Atom]
<< Home