How to use
fullSequentialParse
method
in
net.htmlparser.jericho.Source

Best Java code snippets using net.htmlparser.jericho.Source.fullSequentialParse (Showing top 10 results out of 315)

/**
 * Returns a list of all {@linkplain Tag tags} in this source document.
 * <p>
 * Calling this method on the <code>Source</code> object performs a {@linkplain #fullSequentialParse() full sequential parse} automatically.
 * <p>
 * See the {@link Tag} class documentation for more details about the behaviour of this method.
 *
 * @return a list of all {@linkplain Tag tags} in this source document.
 */
public List<Tag> getAllTags() {
  if (allTags==null) fullSequentialParse();
  return allTags;
}

public NodeIterator(final Segment segment) {
  this.segment=segment;
  source=segment.source;
  if (segment==source) source.fullSequentialParse();
  pos=segment.begin;
  nextTag=source.getNextTag(pos);
  if (nextTag!=null && nextTag.begin>=segment.end) nextTag=null;
}

public static void htmlToPlainText(@Nonnull Reader from, @Nonnull Writer to) throws IOException {
  final Source source = new Source(from);
  source.fullSequentialParse();
  final TextExtractor extractor = source.getTextExtractor().setIncludeAttributes(true);
  extractor.writeTo(to);
}

public void appendTo(final Appendable appendable) throws IOException {
  this.appendable=appendable;
  if (segment instanceof Source) ((Source)segment).fullSequentialParse();
  nextTag=segment.source.getNextTag(segment.begin);
  index=segment.begin;
  appendContent(segment.end,segment.getChildElements(),0);
}

sourceHtml.fullSequentialParse();
List<Tag> tags = sourceHtml.getAllTags();

public HtmlContextAnalyser (HttpMessage msg) {
  this.msg = msg;
  this.htmlPage = msg.getResponseHeader().toString() + msg.getResponseBody().toString();
  src = new Source(htmlPage);
  src.fullSequentialParse();
}

trgHtml.fullSequentialParse();
List<Element> trgEntries = trgHtml.getAllElements(HTMLElementName.P);
srcHtml.fullSequentialParse();
List<Element> srcEntries = srcHtml.getAllElements(HTMLElementName.P);

  childElements=Collections.emptyList();
} else {
  if (allTags==null) fullSequentialParse();
  childElements=new ArrayList<Element>();
  int pos=0;

html.fullSequentialParse();

@Override
public String filter(String source, Map<String, Object> properties) {
  Source sourceHtml = new Source(source);
  sourceHtml.setLogger(null);
  sourceHtml.fullSequentialParse();
  OutputDocument outputDocument = new OutputDocument(sourceHtml);
  List<Tag> tags = sourceHtml.getAllTags();
  int pos = 0;
  for (Tag tag : tags) {
    boolean correctAndAllowedTag = processTag(tag, outputDocument);
    if (!correctAndAllowedTag) {
      String elementName = tag.getName().toLowerCase();
      if (removedTags.contains(elementName) || allowedTags.contains(elementName)) {
        outputDocument.remove(tag);
      } else {
        outputDocument.replace(tag, StringEscapeUtils.escapeHtml(tag.toString()));
      }
    }
    reencodeTextSegment(sourceHtml, outputDocument, pos, tag.getBegin());
    pos = tag.getEnd();
  }
  reencodeTextSegment(sourceHtml, outputDocument, pos, sourceHtml.getEnd());
  return correctNewLineSigns(outputDocument.toString(), properties);
}

Javadoc

Parses all of the Tag in this source document sequentially from beginning to end.

Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed.

Calling the #getAllTags(), #getAllStartTags(), #getAllElements(), #getChildElements(), #iterator() or #getNodeIterator()method on the Source object performs a full sequential parse automatically. There are however still circumstances where it should be called manually, such as when it is known that most or all of the tags in the document will need to be parsed, but none of the abovementioned methods are used, or are called only after calling one or more other tag search methods.

If this method is called manually, is should be called soon after the Source object is created, before any tag search methods are called.

By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a TagType#isValidPosition(Source,int,int[]).

Generally speaking, a tag is in a valid position if it does not appear inside any another tag. TagType#isServerTag() can appear anywhere in a document, including inside other tags, so this relates only to non-server tags. Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it.

When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical. This compromise involves only checking whether the position is enclosed by other tags with TagType#getTagTypesIgnoringEnclosedMarkup(). The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need parsing this is an extremely beneficial trade-off.

The documentation of the TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData) method, which is called internally by the parser to perform the valid position check, includes a more detailed explanation of the differences between the two modes of operation.

Calling this method a second or subsequent time has no effect.

This method returns the same list of tags as the Source#getAllTags() method, but as an array instead of a list.

If this method is called after any of the tag search methods are called, the #getCacheDebugInfo() is cleared of any previously found tags before being restocked via the full sequential parse, and the following message is logged at Logger#info(String) level: "Full sequential parse clearing all tags from cache. Consider calling Source.fullSequentialParse() manually immediately after construction of Source."

This means that if you still have references to tags or elements from before the full sequential parse, they will not be the same objects as those that are returned by tag search methods after the full sequential parse, which can cause confusion if you are allocating Tag#setUserData(Object) to tags. It is also significant if the Segment#ignoreWhenParsing() method has been called since the tags were first found, as any tags inside the ignored segments will no longer be returned by any of the tag search methods.

See also the Tag class documentation for more general details about how tags are parsed.

Popular methods of Source

<init>
getAllElements
getChildElements
Returns a list of the top-level Element in the document element hierarchy. The objects in the list a
getRow
Returns the row number of the specified character position in the source document.
setLogger
Sets the Logger that handles log messages. Specifying a null argument disables logging completely fo
subSequence
Returns a new character sequence that is a subsequence of this source document.
getAllStartTags
getNextEndTag
Returns the EndTag of the specified EndTagType beginning at or immediately following the specified p
toString
Returns the source text as a String.
getAllTags
Returns a list of all Tag in this source document. Calling this method on the Source object performs
getEnd
getFirstElement

Popular in Java

Running tasks concurrently on multiple threads
findViewById (Activity)
setRequestProperty (URLConnection)
onRequestPermissionsResult (Fragment)
BufferedWriter (java.io)
Wraps an existing Writer and buffers the output. Expensive interaction with the underlying reader is
Logger (org.slf4j)
The org.slf4j.Logger interface is the main user entry point of SLF4J API. It is expected that loggin
LoggerFactory (org.slf4j)
The LoggerFactory is a utility class producing Loggers for various logging APIs, most notably for lo
BorderLayout (java.awt)
A border layout lays out a container, arranging and resizing its components to fit in five regions:
JComboBox (javax.swing)
Scheduler (org.quartz)
This is the main interface of a Quartz Scheduler. A Scheduler maintains a registry of org.quartz.Job
Best plugins for Eclipse

How to use fullSequentialParsemethodin net.htmlparser.jericho.Source

Best Java code snippets using net.htmlparser.jericho.Source.fullSequentialParse (Showing top 10 results out of 315)

How to use
fullSequentialParse
method
in
net.htmlparser.jericho.Source